Switch to unified view

a/README.md b/README.md
1
# DeepSlide: A Sliding Window Framework for Classification of High Resolution Microscopy Images (Whole-Slide Images)
1
# DeepSlide: A Sliding Window Framework for Classification of High Resolution Microscopy Images (Whole-Slide Images)
2
2
3
This repository is a sliding window framework for classification of high resolution whole-slide images, often called microscopy or histopathology images. This is also the code for the paper [Pathologist-level Classification of Histologic Patterns on Resected Lung Adenocarcinoma Slides with Deep Neural Networks](https://www.nature.com/articles/s41598-019-40041-7). For a practical guide and implementation tips, see the Medium post [Classification of Histopathology Images with Deep Learning: A Practical Guide](https://medium.com/health-data-science/classification-of-histopathology-images-with-deep-learning-a-practical-guide-2e3ffd6d59c5). 
3
This repository is a sliding window framework for classification of high resolution whole-slide images, often called microscopy or histopathology images. This is also the code for the paper [Pathologist-level Classification of Histologic Patterns on Resected Lung Adenocarcinoma Slides with Deep Neural Networks](https://www.nature.com/articles/s41598-019-40041-7). For a practical guide and implementation tips, see the Medium post [Classification of Histopathology Images with Deep Learning: A Practical Guide](https://medium.com/health-data-science/classification-of-histopathology-images-with-deep-learning-a-practical-guide-2e3ffd6d59c5). 
4
4
5
We have made 143 digitized high-resolution histology slides of lung adenocarcinoma in the test set and their predominant subtypes according to the consensus opinion of three pathologists at Dartmouth-Hitchcock Medical Center publicly available. More information about this dataset and instructions on how to download are provided on the [dataset webpage](https://bmirds.github.io/LungCancer).
5
We have made 143 digitized high-resolution histology slides of lung adenocarcinoma in the test set and their predominant subtypes according to the consensus opinion of three pathologists at Dartmouth-Hitchcock Medical Center publicly available. More information about this dataset and instructions on how to download are provided on the [dataset webpage](https://bmirds.github.io/LungCancer).
6
6
7
*For questions about our code, please open an issue on this code repository.*
7
*For questions about our code, please open an issue on this code repository.*
8
8
9
9
10
![alt text](figures/figure-2-color.jpeg)
11
12
## Requirements
10
## Requirements
13
- [imageio](https://pypi.org/project/imageio/)
11
- [imageio](https://pypi.org/project/imageio/)
14
- [NumPy 1.16+](https://numpy.org/)
12
- [NumPy 1.16+](https://numpy.org/)
15
- [OpenCV](https://opencv.org/)
13
- [OpenCV](https://opencv.org/)
16
- [OpenSlide](https://openslide.org/)
14
- [OpenSlide](https://openslide.org/)
17
- [OpenSlide Python](https://openslide.org/api/python/)
15
- [OpenSlide Python](https://openslide.org/api/python/)
18
- [pandas](https://pandas.pydata.org/)
16
- [pandas](https://pandas.pydata.org/)
19
- [PIL](https://pillow.readthedocs.io/en/5.3.x/)
17
- [PIL](https://pillow.readthedocs.io/en/5.3.x/)
20
- [Python 3.7+](https://www.python.org/downloads/release/python-360/)
18
- [Python 3.7+](https://www.python.org/downloads/release/python-360/)
21
- [PyTorch](https://pytorch.org/)
19
- [PyTorch](https://pytorch.org/)
22
- [scikit-image](https://scikit-image.org/)
20
- [scikit-image](https://scikit-image.org/)
23
- [scikit-learn](https://scikit-learn.org/stable/install.html)
21
- [scikit-learn](https://scikit-learn.org/stable/install.html)
24
- [SciPy](https://www.scipy.org/)
22
- [SciPy](https://www.scipy.org/)
25
- [NVIDIA GPU](https://www.nvidia.com/en-us/)
23
- [NVIDIA GPU](https://www.nvidia.com/en-us/)
26
- [Ubuntu](https://ubuntu.com/)
24
- [Ubuntu](https://ubuntu.com/)
27
25
28
## Installing Dependencies (Recommended method)
26
## Installing Dependencies (Recommended method)
29
27
30
`conda env create --file setup/conda_env.yaml`
28
`conda env create --file setup/conda_env.yaml`
31
29
32
This command creates a conda environment called 'deepslide_env' with Python 3.9 and PyTorch with CUDA 11.3. Please modify the environment file(s) for other versions.
30
This command creates a conda environment called 'deepslide_env' with Python 3.9 and PyTorch with CUDA 11.3. Please modify the environment file(s) for other versions.
33
31
34
In addition, `install_openslide.sh` installs dependencies of OpenSlide package in Ubuntu. For other platforms, please visit to the OpenSlide's official website for more information.
32
In addition, `install_openslide.sh` installs dependencies of OpenSlide package in Ubuntu. For other platforms, please visit to the OpenSlide's official website for more information.
35
33
36
# Usage
34
# Usage
37
35
38
Take a look at `code/config.py` before you begin to get a feel for what parameters can be changed.
36
Take a look at `code/config.py` before you begin to get a feel for what parameters can be changed.
39
37
40
## 1. Train-Val-Test Split:
38
## 1. Train-Val-Test Split:
41
39
42
Splits the data into a validation and test set. Default validation whole-slide images (WSI) per class is 20 and test images per class is 30. You can change these numbers by changing the `--val_wsi_per_class` and `--test_wsi_per_class` flags at runtime. You can skip this step if you did a custom split (for example, you need to split by patients).
40
Splits the data into a validation and test set. Default validation whole-slide images (WSI) per class is 20 and test images per class is 30. You can change these numbers by changing the `--val_wsi_per_class` and `--test_wsi_per_class` flags at runtime. You can skip this step if you did a custom split (for example, you need to split by patients).
43
41
44
```
42
```
45
python code/1_split.py
43
python code/1_split.py
46
```
44
```
47
45
48
If you do not want to duplicate the data, append `--keep_orig_copy False` to the above command.
46
If you do not want to duplicate the data, append `--keep_orig_copy False` to the above command.
49
47
50
**Inputs**: `all_wsi` 
48
**Inputs**: `all_wsi` 
51
49
52
**Outputs**: `wsi_train`, `wsi_val`, `wsi_test`, `labels_train.csv`, `labels_val.csv`, `labels_test.csv`
50
**Outputs**: `wsi_train`, `wsi_val`, `wsi_test`, `labels_train.csv`, `labels_val.csv`, `labels_test.csv`
53
51
54
Note that `all_wsi` must contain subfolders of images labeled by class. For instance, if your two classes are `a` and `n`, you must have `a/*.jpg` with the images in class `a` and `n/*.jpg` with images in class `n`.
52
Note that `all_wsi` must contain subfolders of images labeled by class. For instance, if your two classes are `a` and `n`, you must have `a/*.jpg` with the images in class `a` and `n/*.jpg` with images in class `n`.
55
53
56
If you already have a patch-based preprocessed dataset, you may skip to Stage 3 for model training. Please make sure that at least:
54
If you already have a patch-based preprocessed dataset, you may skip to Stage 3 for model training. Please make sure that at least:
57
55
58
1. `all_wsi` has a folder for each class as a placeholder (they can be empty).
56
1. `all_wsi` has a folder for each class as a placeholder (they can be empty).
59
57
60
2. Both `train_folder/train` and `train_folder/val` folders contain a folder for each class and each slide that belongs to its partition. The slide folder should contain at least one patch extracted from the slide (e.g., `train_folder/train/<class_name>/<slide_name>/<patch_file>`).
58
2. Both `train_folder/train` and `train_folder/val` folders contain a folder for each class and each slide that belongs to its partition. The slide folder should contain at least one patch extracted from the slide (e.g., `train_folder/train/<class_name>/<slide_name>/<patch_file>`).
61
59
62
3. Review `code/config.py` and make appropriate/necessary changes for your dataset.
60
3. Review `code/config.py` and make appropriate/necessary changes for your dataset.
63
61
64
### Example
62
### Example
65
```
63
```
66
python code/1_split.py --val_wsi_per_class 10 --test_wsi_per_class 20
64
python code/1_split.py --val_wsi_per_class 10 --test_wsi_per_class 20
67
```
65
```
68
66
69
## 2. Data Processing
67
## 2. Data Processing
70
68
71
- Generate patches for the training set.
69
- Generate patches for the training set.
72
- Balance the class distribution for the training set.
70
- Balance the class distribution for the training set.
73
- Generate patches for the validation set.
71
- Generate patches for the validation set.
74
- Generate patches by folder for WSI in the validation set.
72
- Generate patches by folder for WSI in the validation set.
75
- Generate patches by folder for WSI in the testing set.
73
- Generate patches by folder for WSI in the testing set.
76
74
77
```
75
```
78
python code/2_process_patches.py
76
python code/2_process_patches.py
79
```
77
```
80
78
81
Note that this will take up a significant amount of space. Change `--num_train_per_class` to be smaller if you wish not to generate as many windows. If your histopathology images are H&E-stained, whitespace will automatically be filtered. Turn this off using the option `--type_histopath False`. Default overlapping area is 1/3 for test slides. Use 1 or 2 if your images are very large; you can also change this using the `--slide_overlap` option.
79
Note that this will take up a significant amount of space. Change `--num_train_per_class` to be smaller if you wish not to generate as many windows. If your histopathology images are H&E-stained, whitespace will automatically be filtered. Turn this off using the option `--type_histopath False`. Default overlapping area is 1/3 for test slides. Use 1 or 2 if your images are very large; you can also change this using the `--slide_overlap` option.
82
80
83
**Inputs**: `wsi_train`, `wsi_val`, `wsi_test`
81
**Inputs**: `wsi_train`, `wsi_val`, `wsi_test`
84
82
85
**Outputs**: `train_folder` (fed into model for training), `patches_eval_train` (for validation, sorted by WSI), `patches_eval_test` (for testing, sorted by WSI)
83
**Outputs**: `train_folder` (fed into model for training), `patches_eval_train` (for validation, sorted by WSI), `patches_eval_test` (for testing, sorted by WSI)
86
84
87
### Example
85
### Example
88
```
86
```
89
python code/2_process_patches.py --num_train_per_class 20000 --slide_overlap 2
87
python code/2_process_patches.py --num_train_per_class 20000 --slide_overlap 2
90
```
88
```
91
89
92
90
93
## 3. Model Training
91
## 3. Model Training
94
92
95
```
93
```
96
CUDA_VISIBLE_DEVICES=0 python code/3_train.py
94
CUDA_VISIBLE_DEVICES=0 python code/3_train.py
97
```
95
```
98
96
99
We recommend using ResNet-18 if you are training on a relatively small histopathology dataset. You can change hyperparameters using the `argparse` flags. There is an option to retrain from a previous checkpoint. Model checkpoints are saved by default every epoch in `checkpoints`.
97
We recommend using ResNet-18 if you are training on a relatively small histopathology dataset. You can change hyperparameters using the `argparse` flags. There is an option to retrain from a previous checkpoint. Model checkpoints are saved by default every epoch in `checkpoints`.
100
98
101
**Inputs**: `train_folder`
99
**Inputs**: `train_folder`
102
100
103
**Outputs**: `checkpoints`, `logs`
101
**Outputs**: `checkpoints`, `logs`
104
102
105
### Example
103
### Example
106
```
104
```
107
CUDA_VISIBLE_DEVICES=0 python code/3_train.py --batch_size 32 --num_epochs 100 --save_interval 5
105
CUDA_VISIBLE_DEVICES=0 python code/3_train.py --batch_size 32 --num_epochs 100 --save_interval 5
108
```
106
```
109
107
110
## 4. Testing on WSI
108
## 4. Testing on WSI
111
109
112
Run the model on all the patches for each WSI in the validation and test set.
110
Run the model on all the patches for each WSI in the validation and test set.
113
111
114
```
112
```
115
CUDA_VISIBLE_DEVICES=0 python code/4_test.py
113
CUDA_VISIBLE_DEVICES=0 python code/4_test.py
116
```
114
```
117
115
118
We automatically choose the model with the best validation accuracy. You can also specify your own. You can change the thresholds used in the grid search by specifying the `threshold_search` variable in `code/config.py`.
116
We automatically choose the model with the best validation accuracy. You can also specify your own. You can change the thresholds used in the grid search by specifying the `threshold_search` variable in `code/config.py`.
119
117
120
**Inputs**: `patches_eval_val`, `patches_eval_test`
118
**Inputs**: `patches_eval_val`, `patches_eval_test`
121
119
122
**Outputs**: `preds_val`, `preds_test`
120
**Outputs**: `preds_val`, `preds_test`
123
121
124
### Example
122
### Example
125
```
123
```
126
CUDA_VISIBLE_DEVICES=0 python code/4_test.py --auto_select False
124
CUDA_VISIBLE_DEVICES=0 python code/4_test.py --auto_select False
127
```
125
```
128
126
129
127
130
## 5. Searching for Best Thresholds
128
## 5. Searching for Best Thresholds
131
129
132
The simplest way to make a whole-slide inference is to choose the class with the most patch predictions. We can also implement thresholding on the patch level to throw out noise. To find the best thresholds, we perform a grid search. This function will generate csv files for each WSI with the predictions for each patch.
130
The simplest way to make a whole-slide inference is to choose the class with the most patch predictions. We can also implement thresholding on the patch level to throw out noise. To find the best thresholds, we perform a grid search. This function will generate csv files for each WSI with the predictions for each patch.
133
131
134
```
132
```
135
python code/5_grid_search.py
133
python code/5_grid_search.py
136
```
134
```
137
135
138
**Inputs**: `preds_val`, `labels_val.csv`
136
**Inputs**: `preds_val`, `labels_val.csv`
139
137
140
**Outputs**: `inference_val`
138
**Outputs**: `inference_val`
141
139
142
### Example
140
### Example
143
```
141
```
144
python code/5_grid_search.py --preds_val different_labels_val.csv
142
python code/5_grid_search.py --preds_val different_labels_val.csv
145
```
143
```
146
144
147
## 6. Visualization
145
## 6. Visualization
148
146
149
A good way to see what the network is looking at is to visualize the predictions for each class.
147
A good way to see what the network is looking at is to visualize the predictions for each class.
150
148
151
```
149
```
152
python code/6_visualize.py
150
python code/6_visualize.py
153
```
151
```
154
152
155
**Inputs**: `wsi_val`, `preds_val`
153
**Inputs**: `wsi_val`, `preds_val`
156
154
157
**Outputs**: `vis_val`
155
**Outputs**: `vis_val`
158
156
159
You can change the colors in `colors` in `code/config.py`
157
You can change the colors in `colors` in `code/config.py`
160
158
161
![alt text](figures/sample.jpeg)
159
![alt text](figures/sample.jpeg)
162
160
163
### Example
161
### Example
164
```
162
```
165
python code/6_visualize.py --vis_test different_vis_test_directory
163
python code/6_visualize.py --vis_test different_vis_test_directory
166
```
164
```
167
165
168
166
169
## 7. Final Testing
167
## 7. Final Testing
170
168
171
Do the final testing to compute the confusion matrix on the test set.
169
Do the final testing to compute the confusion matrix on the test set.
172
170
173
```
171
```
174
python code/7_final_test.py
172
python code/7_final_test.py
175
```
173
```
176
174
177
**Inputs**: `preds_test`, `labels_test.csv`, `inference_val` and `labels_val` (for the best thresholds)
175
**Inputs**: `preds_test`, `labels_test.csv`, `inference_val` and `labels_val` (for the best thresholds)
178
176
179
**Outputs**: `inference_test` and confusion matrix to stdout
177
**Outputs**: `inference_test` and confusion matrix to stdout
180
178
181
### Example
179
### Example
182
```
180
```
183
python code/7_final_test.py --labels_test different_labels_test.csv
181
python code/7_final_test.py --labels_test different_labels_test.csv
184
```
182
```
185
183
186
Best of luck.
184
Best of luck.
187
185
188
# Quick Run
186
# Quick Run
189
187
190
If you want to run all code and change the default parameters in `code/config.py`, run
188
If you want to run all code and change the default parameters in `code/config.py`, run
191
```
189
```
192
sh code/run_all.sh
190
sh code/run_all.sh
193
```
191
```
194
and change the desired flags on each line of the `code/run_all.sh` script.
192
and change the desired flags on each line of the `code/run_all.sh` script.
195
193
196
194
197
# Pre-Processing Scripts
195
# Pre-Processing Scripts
198
196
199
See `code/z_preprocessing` for some code to convert images from svs into jpg. This uses OpenSlide and takes a while. How much you want to compress images will depend on the resolution that they were originally scanned, but a guideline that has worked for us is 3-5 MB per WSI.
197
See `code/z_preprocessing` for some code to convert images from svs into jpg. This uses OpenSlide and takes a while. How much you want to compress images will depend on the resolution that they were originally scanned, but a guideline that has worked for us is 3-5 MB per WSI.
200
198
201
# Known Issues and Limitations
199
# Known Issues and Limitations
202
- Only 1 GPU supported.
200
- Only 1 GPU supported.
203
- Should work, but not tested on Windows.
201
- Should work, but not tested on Windows.
204
- In cases where no crops are found for an image, empty directories are created. Current workaround uses `try` and `except` statements to catch errors.
202
- In cases where no crops are found for an image, empty directories are created. Current workaround uses `try` and `except` statements to catch errors.
205
- Image reading code expects colors to be in the RGB space. Current workaround is to keep first 3 channels.
203
- Image reading code expects colors to be in the RGB space. Current workaround is to keep first 3 channels.
206
- This code will likely work better when the labels are at the tissue level. It will still work for the entire WSI, but results may vary.
204
- This code will likely work better when the labels are at the tissue level. It will still work for the entire WSI, but results may vary.
207
205
208
# Still not working? Consider the following...
206
# Still not working? Consider the following...
209
207
210
- Ask a pathologist to look at your visualizations.
208
- Ask a pathologist to look at your visualizations.
211
- Make your own heuristic for aggregating patch predictions to determine the WSI-level classification. Often, a slide thats 20% abnormal and 80% normal should be classified as abnormal.
209
- Make your own heuristic for aggregating patch predictions to determine the WSI-level classification. Often, a slide thats 20% abnormal and 80% normal should be classified as abnormal.
212
- If each WSI can have multiple types of lesions/labels, you may need to annotate bounding boxes around these.
210
- If each WSI can have multiple types of lesions/labels, you may need to annotate bounding boxes around these.
213
- Did you pre-process your images? If you used raw .svs files that are more than 1GB in size, its likely that the patches are way too zoomed in to see any cell structures.
211
- Did you pre-process your images? If you used raw .svs files that are more than 1GB in size, its likely that the patches are way too zoomed in to see any cell structures.
214
- If you have less than 10 WSI per class in the training set, obtain more.
212
- If you have less than 10 WSI per class in the training set, obtain more.
215
- Feel free to view our end-to-end attention-based model in JAMA Network Open: [https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2753982](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2753982).
213
- Feel free to view our end-to-end attention-based model in JAMA Network Open: [https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2753982](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2753982).
216
214
217
# Future Work
215
# Future Work
218
216
219
- Contributions to this repository are welcome. 
217
- Contributions to this repository are welcome. 
220
- Code for generating patches on the fly instead of storing them in memory for training and testing would save a lot of disk space.
218
- Code for generating patches on the fly instead of storing them in memory for training and testing would save a lot of disk space.
221
- If you have issues, please post in the issues section and we will do our best to help.
219
- If you have issues, please post in the issues section and we will do our best to help.
222
220
223
# Citations
221
# Citations
224
222
225
DeepSlide is an open-source library and is licensed under the [GNU General Public License (v3)](https://www.gnu.org/licenses/gpl-3.0.en.html). If you are using this library please cite:
223
DeepSlide is an open-source library and is licensed under the [GNU General Public License (v3)](https://www.gnu.org/licenses/gpl-3.0.en.html). If you are using this library please cite:
226
224
227
```Jason Wei, Laura Tafe, Yevgeniy Linnik, Louis Vaickus, Naofumi Tomita, Saeed Hassanpour, "Pathologist-level Classification of Histologic Patterns on Resected Lung Adenocarcinoma Slides with Deep Neural Networks", Scientific Reports;9:3358 (2019).```
225
```Jason Wei, Laura Tafe, Yevgeniy Linnik, Louis Vaickus, Naofumi Tomita, Saeed Hassanpour, "Pathologist-level Classification of Histologic Patterns on Resected Lung Adenocarcinoma Slides with Deep Neural Networks", Scientific Reports;9:3358 (2019).```
228
226