Diff of /README.md [000000] .. [4f54f1]

Switch to unified view

a b/README.md
1
# README #
2
3
4
5
## Analysis and experiments with early cancer detection in lung CT scans ##
6
7
8
9
### What is this repository for? ###
10
11
* Experimenting with approaches for processing the lung CT scans that are publicly available in the kaggle competition  Data Science Bowl 2017.
12
* Experimenting with deep neural networks for training a model that helps early cancer detection.
13
14
15
### How do I get set up? ###
16
 
17
1. The project is written in python. The easiest way to set up the environment is using Anaconda. Here is the link for downloading https://www.continuum.io/downloads. Python 3.x is required, the preferred version of Anaconda is the one using python 3.6.
18
19
2. Required python modules
20
21
   * **pydicom**
22
   * **scikit-image**
23
   * **scikit-learn**
24
   * **scipy**
25
   * **sklearn**
26
   * **pandas**
27
   * **imutils**
28
   * **tensorflow**
29
   * **google-api-python-client**
30
   * **google-cloud-storage**
31
   * **opencv** module for python
32
33
   All of the modules instead of tensorflow, cv2 and scikit-image could be easily installed using either 
34
  
35
```
36
      pip install <module_name>
37
```
38
39
  
40
   or 
41
42
 
43
```
44
     conda install <module_name>
45
```
46
   
47
   Though some of the modules are required to be installed using conda, since Anaconda is hadling some of the environment issues and provides precompiled binaries for the modules.
48
49
   Installing scikit-image:
50
51
```
52
     conda install scikit-image
53
```
54
55
   Installing opencv:
56
     
57
     
58
```
59
60
     conda install -c menpo opencv3
61
62
```
63
  
64
   Before installing any modules, start with setting up the Anaconda environment suitable for
65
   installing the tensorflow library.
66
   
67
   Creating CPU tensorflow environment:
68
69
    
70
```
71
     conda create --name tensorflow python=3.5
72
     activate tensorflow
73
     conda install jupyter
74
     conda install scipy
75
     pip install tensorflow
76
```
77
78
     
79
80
   You should specify the python version to be 3.5 when creating the environment.
81
82
83
   Creating GPU tensorflow environment:
84
  
85
     
86
```
87
     conda create --name tensorflow-gpu python=3.5
88
     activate tensorflow-gpu
89
     conda install jupyter
90
     conda install scipy
91
     pip install tensorflow-gpu
92
```
93
 
94
95
  To switch between environments you can simply use ***deactivate*** to deactivate the current one and ***activate tensorflow-gpu*** for
96
  example, if you want to switch to the tensorflow environment with gpu support.
97
98
99
100
  In order to run tensorflow with GPU support you must install Guda Toolkit and cuDNN.
101
102
  *   Installing on Windows - https://www.tensorflow.org/install/install_windows
103
  *   Installing on Ubuntu - https://www.tensorflow.org/install/install_linux
104
  *   Installing on Mac OS X - https://www.tensorflow.org/install/install_mac
105
106
107
  After installing tensorflow, you can simply use the requirements.txt file provided in the project. Execute the following line:
108
109
110
```
111
      pip install -r requirements.txt
112
```
113
114
  In case you are unable to install some of the modules listed in the requirements
115
  file, do remove it and try installing it using:
116
117
118
```
119
     conda install <module-name>
120
```
121
122
123
###  How to start data preprocessing? ###
124
125
  
126
To start processing the dicom files you need to run:
127
128
   
129
```
130
     python preprocess_dicoms.py
131
```
132
133
134
  Although this step is not required, since original images are too big and data preprocessing is time consuming. First stage of image preprocessing has been already executed and the data is stored in Google Cloud using several buckets:
135
136
   *  Baseline - https://console.cloud.google.com/storage/browser/baseline-preprocess/baseline_preprocessing/?project=lung-cancer-tests
137
138
   *  Morphological operations segmentation - https://console.cloud.google.com/storage/browser/segmented-lungs/segmented_morph_op/?project=lung-cancer-tests
139
140
   * Waterhsed segmentation - https://console.cloud.google.com/storage/browser/segmented-lungs-watershed/segmented_watershed/?project=lung-cancer-tests
141
 
142
143
  To simply download the data required for the model to be trained you need to execute:
144
145
146
```
147
     python data_collector.py
148
```
149
  
150
  Compressed 3D patient images will be downloaded and by default stored under ***./fetch_data/ *** directory.
151
152
  To select which model will be trained, you need to change the value of SELECTED_MODEL in ***config.py*** (simply choose one of the predefined model names and the other configurations will be changed correspondingly)
153
154
  Source and destination directories are configurable using the ***config.py***:
155
156
   * ALL_IMGS points to the directory with the original dicom files for each patient (you do not need to edit it, if you have used the download script to store the preprocessed data as mentioned in the previous step)
157
158
   * SEGMENTED_LUNGS_DIR points to the directory where the segmented lungs will be stored in a 
159
    .npz file for each patient (compressed numpy array). The directory is configured automaticaly depending on the selected model to be trained
160
161
162
163
### How to start model training? ###
164
165
166
 To start model training simply execute
167
 
168
169
```
170
     python model_train.py
171
```
172
173
 Configuration and definitions of the layers for the CNN are described in python files located under ***model_definition***. Three configurations are currently available:
174
175
  * baseline.py  - Baseline configuration with three convolutional layers and two fully connected.
176
177
  * additional_layers.py - Deeper network with four convolutional and
178
  three fully connected layers.
179
180
  * default.py - The default configuration also has seven layers in total, but some of the filters have different sizes from those in the previous configuration.
181
  
182
183
### How to evaluate stored model? ###
184
185
 You can evaluate the results for the training set for an already stored model. Networks states are stored under the directory pointed from the MODELS_STORE_DIR configuration. 
186
 For each of the trained models a directory with the name of the model is created and all stored states are saved there. To evaluate the network at some of the saved states
187
 you must change the RESTORE_MODEL_CKPT to point to the checkpoint file with the desired name. Then simply execute from the command line:
188
 
189
```
190
     python trained_model_loader.py
191
```
192
193
 The solution will be stored in a csv file with location and name configured using SOLUTION_FILE_PATH, by default it is ***'./solution_last.csv'***. In the command line you will also see an evaluation of the solution – confusion matrix for the training set, logarithmic loss, accuracy, sensitivity / recall and specificity. Also a csv report will be generated with the predicted results and the exact labels of the test data. The report name is constructed from the solution file name appending the prefix report_ and is located in the same directory as the solution file.
194
 
195
196
### Other configurations? ###
197
198
199
Other properties that might be configured are related with storing model states and summary exported during training.
200
  
201
  * SUMMARIES_DIR - points to the directory where summary for the error, accuracy and sensitivity is exported during training. The data can be viewed using Tensorboard.
202
  * RESTORE_MODEL_CKPT - point to the checkpoint file, you might want to resume training from or simply use for evaluating test examples with the saved state of the network (if you want to resume training set RESTORE to True and point out the START_STEP for proper counting of the epochs)