Diff of /README.md [000000] .. [797475]

Switch to unified view

a b/README.md
1
# NoduleX
2
Supporting code for the paper _"Highly accurate model for prediction of lung nodule malignancy with CT scans"_.
3
4
## Instructions
5
After cloning or downloading this repository, extract the files from (http://bioinformatics.astate.edu/NoduleX/NoduleX_data.tar.gz) into the `data` directory.  
6
7
Many of the scripts contained here have several command-line options available.  Run the scripts with the `--help` option to see a usage listing.
8
9
### Requirements
10
* Python2.7, `pip`
11
    - A requirements file `NoduleX_python_requirements.txt` is provided listing required Python packages.  You can install them using:
12
        - `pip install -r NoduleX_python_requirements.txt`
13
    - Setting up a virtual environment is recommended.
14
* QIF feature extraction requires Octave (version tested was 4.2.0), or MATLAB (with some modifications to the helper scripts, see QIF_extraction/README.md).
15
* A POSIX-compatible system (Linux, Mac OS, or Linux shell under Windows) is assumed; many scripts given are written in Bash shell syntax.
16
17
### Running the CNN models against validation data
18
Use the script `keras_CNN/keras_evaluate.py`, providing the correct model .json file (from `data/CNN_models`) and matching weights .hd5 file (from `data/CNN_weights`) and a dataset .hd5 file (from `data/CNN_datasets`).  For example:
19
20
```bash
21
python keras_CNN/keras_evaluate.py \
22
    --window \
23
    data/CNN_models/CNN47.json \
24
    data/CNN_weights/12v45_weights.hd5 \
25
    data/CNN_datasets/S12vS45_s47_VALIDATION.hd5
26
```
27
28
### Training CNN models with training data
29
Use the script `keras_CNN/keras_retrain_model.py`, providing the correct model .json file (from `data/CNN_models`) and a training dataset .hd5 file (from `data/CNN_datasets`).  For example:
30
31
```bash
32
python keras_CNN/keras_retrain_model.py \
33
    --window \
34
    -s 47 \
35
    data/CNN_models/CNN47.json \
36
    data/CNN_datasets/S12vS45_s47_TRAIN.hd5 \
37
    /tmp/CNN47_retrain_checkpoint_dir
38
```
39
40
The final model weights will be saved in the working directory (as a .hd5 file), and checkpoints will be placed in the directory `/tmp/CNN47_retrain_checkpoint_dir` (you can customize this of course).
41
42
Depending on the number of epochs and batch size you choose (default is 200 and 64), the model may overfit.  Examine the checkpoint models as well as the final model to determine the best overall performance.  (Typically the best will be one of the last 3 checkpoints or the final model.)  Training is stochastic, so repeated training will yield different results.
43
44
### Building datasets from LIDC-IDRI
45
Start by downloading the data files for LIDC-IDRI (
46
https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI) and extract the DOI folder in `data`.
47
48
Run the script `dicom_and_image_tools/simplify_doi_structure.sh` against the DOI direcotry (you may choose to create symlinks with the `-s` option).  This produces a directory structure that is flattened with naming based on patient identifiers.  It could also be useful to run `dicom_and_image_tools/rename_dicom_by_position.py` against each of the patient directories to name the .dcm files themselves in order of 'sliceNo' (not necessary, but it makes the files easier to reason about).
49
50
#### Extracting CNN Cubes
51
Extract input volumes for the CNN by using the script `dicom_and_image_tools/create_data_file_from_seed_points.py`.  Candidate nodule lists are in the directory `data/nodule_lists`.  Run the script using the top-level of your flattened data directory from the previous step as `dicom_dir`.
52
53
#### Creating Segmentation Masks for QIF Feature Computation
54
The QIF feature extraction code requires the original image ("grey image") and an image where all pixels representing the ROI are set to 1 while all others are set to 0 ("binary image").  To create the binary images, use `dicom_and_image_tools/segment_to_binary_image.py` as follows:
55
56
* **For "nodules"**:  For any nodule with a malignancy rating 1-5, segmentations are provided by LIDC-IDRI.  Run the `segment_to_binary_image.py` script with the `--candidates` option pointing to the candidates file (from `data/nodule_lists`) and the `--segmented-only` option.  This works for the "S12vS45", "S1vS45" and "NvNN_nodule-only" candidate lists.
57
* **For "non-nodules"**: For the "non-nodule" dataset (the "NvNN_non-nodule-only" candidate list), a segmentation must be algorithmically generated for each "non-nodule" seed point.  Run the `segment_to_binary_image.py` script with the `--candidates` option pointing to the candidates file (from `data/nodule_lists`).  The segmentation process will take some time.
58
59
#### Converting to Analyze format for QIF Feature Computation
60
The QIF Feature extraction code requires Analyze format for its input files; LIDC-IDRI data is in DICOM format.  To convert (both the "grey" and "binary", see above) DICOM files to Analyze, the tool `dicom_and_image_tools/dicom_to_analyze.py` is provided.  Run it for each patient's scan (producing the "grey" images), and for each nodule segmentation ("binary image") file.
61
62
For example, if you placed your binary DICOM images in a directory `binary_dicom`, you could something similar to the following to convert all nodules for a single patient:
63
64
```bash
65
p=<YOUR-LIDC-IDRI-PATIENT-ID->; \
66
for n in `ls -d data/binary_dicom/$p/*` ; do \
67
    echo "Converting nodule $n" ; \
68
    python dicom_and_image_tools/dicom_to_analyze.py \
69
        "$n" \
70
        "data/binary_analyze/$p/$(basename $n)" \
71
        && echo "OK" \
72
        || echo "FAILED converting nodule $n" \
73
;done
74
```
75
76
Where `<YOUR-LIDC-IDRI-PATIENT-ID>` is the patient ID for the patient whose nodules you are converting.  Adjust paths according to your local directory layout, as necessary.
77
78
#### Extracting QIF Features
79
See the file `README.md` in the `QIF_extraction` directory for the steps required to compute QIF features given the "grey" and "binary" images in Analyze format as described above.
80
81
82
83
84