|
a |
|
b/QIF_extraction/README.md |
|
|
1 |
# Extracting QIF features |
|
|
2 |
|
|
|
3 |
The script `lna_parallel_n.sh` is provided to simplify extracting features from a batch of nodules. The requirements and details for using this script are presented below. |
|
|
4 |
|
|
|
5 |
## Software requirements: |
|
|
6 |
The script expects Octave (tested with version 4.2.0) with the _statistics_ and _image_ packages installed. The code itself will also run under MATLAB, but you will need to modify `lna_batch_driver.sh` and `lna_by_file.sh` to change the Octave commands to the corresponding MATLAB commands. |
|
|
7 |
|
|
|
8 |
## Directory and file naming conventions |
|
|
9 |
In order for the `lna_parallel_n.sh` script to associate the files correctly, it is important that the following directory structure and naming conventions be used. |
|
|
10 |
|
|
|
11 |
* You will need a directory for the original CT scan images in Analyze format (referred to as "grey" images here), and a second directory for segmentation mask images (referred to as "binary" images here). |
|
|
12 |
- For this example, assume the name `grey_dir` is used for the "grey" images, and `mask_dir` is used for the "binary" segmentation masks. |
|
|
13 |
* Beneath the "grey" directory, create one directory for each patient. |
|
|
14 |
- For example, in `grey_dir` create a directory for each patient, using a unique patient ID for each, like `grey_dir/PATIENT-0001`, `grey_dir/PATIENT-0002`, etc. |
|
|
15 |
* Beneath the "binary" directory, create one directory for each nodule, where the directory name should contain the patieht ID followed by a tilde (`~`) symbol followed by a nodule ID. |
|
|
16 |
- For example, in `mask_dir`, if `PATIENT-0001` has two nodules `Nodule1` and `Nodule2`, you would create the directories `mask_dir/PATIENT-0001~Nodule1` and `mask_dir/PATIENT-0001~Nodule2`. |
|
|
17 |
|
|
|
18 |
To summarize, patients must all have a unique ID and that their files will be stored in a directory named with this ID. Nodules must also have a unique ID, and it must be combined with the patient ID in the format `PATIENT-ID~NODULE-ID` (separated by a tilde character); nodule files should be placed into a directory with the combined patient/nodule ID as its name. |
|
|
19 |
|
|
|
20 |
## Prepare the "grey" and "binary" images |
|
|
21 |
The extraction code requires that a matching "grey" (original CT scan) and "binary" (segmentation mask) image exists for each nodule to be examined. |
|
|
22 |
|
|
|
23 |
The "binary" segmentation mask file must be the same size (in terms of X,Y,Z dimension) as the "grey" file, and must have all voxel values set to zero except for voxels belonging to the nodule, which should all be set to 1. |
|
|
24 |
|
|
|
25 |
You can use the same "grey" file for all nodules for a specific patient, but you will need a separate "binary" segmentation mask file for each nodule. The input files must be in Analyze (.img) format. |
|
|
26 |
|
|
|
27 |
For example, if you have a patient PATIENT-0001 with two nodules Nodule1 and Nodule2, you would need to prepare the following: |
|
|
28 |
|
|
|
29 |
* `grey_dir/PATIENT-0001/PATIENT-0001.img , grey_dir/PATIENT-0001/PATIENT-0001.hdr` Analyze format "grey" (original CT image) for PATIENT-0001 |
|
|
30 |
* `mask_dir/PATIENT-0001~Nodule1/PATIENT-0001~Nodule1.img , mask_dir/PATIENT-0001~Nodule1/PATIENT-0001~Nodule1.hdr` - Analyze format "binary" segmentation mask for PATIENT-0001, Nodule1. |
|
|
31 |
* `mask_dir/PATIENT-0001~Nodule2/PATIENT-0001~Nodule2.img , mask_dir/PATIENT-0001~Nodule2/PATIENT-0001~Nodule2.hdr` - Analyze format "binary" segmentation mask for PATIENT-0001, Nodule2. |
|
|
32 |
|
|
|
33 |
## Create one or more nodule list files |
|
|
34 |
You can run the extraction process in parallel -- the number of parallel instances is determined by the number of nodule list files you prepare (each will be run in parallel with the others). |
|
|
35 |
|
|
|
36 |
For example, given the files detailed above for two nodules belonging to PATIENT-0001, you can create two nodule lists, and run two processes in parallel. Create the following files: |
|
|
37 |
|
|
|
38 |
* `nodule-list.1` |
|
|
39 |
* `nodule-list.2` |
|
|
40 |
|
|
|
41 |
With the following contents: |
|
|
42 |
|
|
|
43 |
**`nodule-list.1`**: |
|
|
44 |
|
|
|
45 |
``` |
|
|
46 |
PATIENT-0001~Nodule1 |
|
|
47 |
``` |
|
|
48 |
|
|
|
49 |
**`nodule-list.2`** |
|
|
50 |
``` |
|
|
51 |
PATIENT-0001~Nodule2 |
|
|
52 |
``` |
|
|
53 |
|
|
|
54 |
## Running in parallel |
|
|
55 |
With the list files prepared as detailed above, you can run as follows: |
|
|
56 |
|
|
|
57 |
``` |
|
|
58 |
./lna_parallel_n.sh \ |
|
|
59 |
-g grey_dir \ |
|
|
60 |
-b mask_dir \ |
|
|
61 |
-o output_dir \ |
|
|
62 |
nodule-list.1 nodule-list.2 |
|
|
63 |
``` |
|
|
64 |
|
|
|
65 |
Where `grey_dir` and `mask_dir` match the directories where you have stored the "grey" and "binary" files, and `output_dir` is the location where you want the output to be written. |
|
|
66 |
|
|
|
67 |
## Convert output to CSV for analysis |
|
|
68 |
The output (.hd5) file is not very user-friendly for downstream analysis. Scripts are provided to help convert it to a more usable form: |
|
|
69 |
|
|
|
70 |
* `batch_matlab_features_to_csv.sh` - Used to extract the features from the HDF5 (.hd5) files into easy-to-use CSV (comma-separated) format. |
|
|
71 |
- This script will produce one CSV file per nodule, matching the one HDF5 file per nodule produced by the Octave/MATLAB code. |
|
|
72 |
|
|
|
73 |
**Example** |
|
|
74 |
``` |
|
|
75 |
./batch_matlab_features_to_csv.sh \ |
|
|
76 |
--output_dir /csv_output_dir \ |
|
|
77 |
/output_dir/*.hd5 |
|
|
78 |
``` |
|
|
79 |
|
|
|
80 |
* `collect_features_to_matrix.py` - A Python script used to combine the individual nodule-level CSV files into a single CSV file with one column per nodule. |
|
|
81 |
- For the "Highly accurate model for prediction of lung nodule malignancy with CT scans" paper, we used the following options: |
|
|
82 |
+ `--remove-rows 26 43 44` |
|
|
83 |
|
|
|
84 |
**Example** |
|
|
85 |
``` |
|
|
86 |
python collect_features_to_matrix.py \ |
|
|
87 |
/combined_output_dir/combined2d.csv \ |
|
|
88 |
/csv_output_dir/*features2d.csv \ |
|
|
89 |
--remove-rows 26 43 44 |
|
|
90 |
``` |
|
|
91 |
|
|
|
92 |
# Author Info |
|
|
93 |
The code in the `Matlab_Source` subdirectory was mainly written by David Politte and David Gierada. The driver scripts in this directory were mostly written by Jason L Causey; some code in this directory was written by Justin Porter, and the utility _cell2csv.m_ was written by Sylvain Fiedler. Other code authors and attribution information is listed in header comments in code. |