The script lna_parallel_n.sh
is provided to simplify extracting features from a batch of nodules. The requirements and details for using this script are presented below.
The script expects Octave (tested with version 4.2.0) with the statistics and image packages installed. The code itself will also run under MATLAB, but you will need to modify lna_batch_driver.sh
and lna_by_file.sh
to change the Octave commands to the corresponding MATLAB commands.
In order for the lna_parallel_n.sh
script to associate the files correctly, it is important that the following directory structure and naming conventions be used.
grey_dir
is used for the "grey" images, and mask_dir
is used for the "binary" segmentation masks.grey_dir
create a directory for each patient, using a unique patient ID for each, like grey_dir/PATIENT-0001
, grey_dir/PATIENT-0002
, etc.~
) symbol followed by a nodule ID.mask_dir
, if PATIENT-0001
has two nodules Nodule1
and Nodule2
, you would create the directories mask_dir/PATIENT-0001~Nodule1
and mask_dir/PATIENT-0001~Nodule2
.To summarize, patients must all have a unique ID and that their files will be stored in a directory named with this ID. Nodules must also have a unique ID, and it must be combined with the patient ID in the format PATIENT-ID~NODULE-ID
(separated by a tilde character); nodule files should be placed into a directory with the combined patient/nodule ID as its name.
The extraction code requires that a matching "grey" (original CT scan) and "binary" (segmentation mask) image exists for each nodule to be examined.
The "binary" segmentation mask file must be the same size (in terms of X,Y,Z dimension) as the "grey" file, and must have all voxel values set to zero except for voxels belonging to the nodule, which should all be set to 1.
You can use the same "grey" file for all nodules for a specific patient, but you will need a separate "binary" segmentation mask file for each nodule. The input files must be in Analyze (.img) format.
For example, if you have a patient PATIENT-0001 with two nodules Nodule1 and Nodule2, you would need to prepare the following:
grey_dir/PATIENT-0001/PATIENT-0001.img , grey_dir/PATIENT-0001/PATIENT-0001.hdr
Analyze format "grey" (original CT image) for PATIENT-0001mask_dir/PATIENT-0001~Nodule1/PATIENT-0001~Nodule1.img , mask_dir/PATIENT-0001~Nodule1/PATIENT-0001~Nodule1.hdr
- Analyze format "binary" segmentation mask for PATIENT-0001, Nodule1.mask_dir/PATIENT-0001~Nodule2/PATIENT-0001~Nodule2.img , mask_dir/PATIENT-0001~Nodule2/PATIENT-0001~Nodule2.hdr
- Analyze format "binary" segmentation mask for PATIENT-0001, Nodule2.You can run the extraction process in parallel -- the number of parallel instances is determined by the number of nodule list files you prepare (each will be run in parallel with the others).
For example, given the files detailed above for two nodules belonging to PATIENT-0001, you can create two nodule lists, and run two processes in parallel. Create the following files:
nodule-list.1
nodule-list.2
With the following contents:
nodule-list.1
:
PATIENT-0001~Nodule1
nodule-list.2
PATIENT-0001~Nodule2
With the list files prepared as detailed above, you can run as follows:
./lna_parallel_n.sh \
-g grey_dir \
-b mask_dir \
-o output_dir \
nodule-list.1 nodule-list.2
Where grey_dir
and mask_dir
match the directories where you have stored the "grey" and "binary" files, and output_dir
is the location where you want the output to be written.
The output (.hd5) file is not very user-friendly for downstream analysis. Scripts are provided to help convert it to a more usable form:
batch_matlab_features_to_csv.sh
- Used to extract the features from the HDF5 (.hd5) files into easy-to-use CSV (comma-separated) format.Example
./batch_matlab_features_to_csv.sh \
--output_dir /csv_output_dir \
/output_dir/*.hd5
collect_features_to_matrix.py
- A Python script used to combine the individual nodule-level CSV files into a single CSV file with one column per nodule.--remove-rows 26 43 44
Example
python collect_features_to_matrix.py \
/combined_output_dir/combined2d.csv \
/csv_output_dir/*features2d.csv \
--remove-rows 26 43 44
The code in the Matlab_Source
subdirectory was mainly written by David Politte and David Gierada. The driver scripts in this directory were mostly written by Jason L Causey; some code in this directory was written by Justin Porter, and the utility cell2csv.m was written by Sylvain Fiedler. Other code authors and attribution information is listed in header comments in code.