Diff of /docs/README.md [000000] .. [fdd588]

Switch to unified view

a b/docs/README.md
1
2
TOAD 🐸 <img src="UNP.png" width="325px" align="right" />
3
===========
4
### AI-based Pathology Predicts Origins for Cancers of Unknown Primary 
5
*Nature*
6
7
[Read Link](https://t.co/HTkIdg55Lw?amp=1) | [Journal Link](https://dx.doi.org/10.1038/s41586-021-03512-4) | [Interactive Demo](http://toad.mahmoodlab.org) | [Cite](#reference)
8
9
*TL;DR: In this work we propose to use weakly-supervised multi-task computational pathology to aid the differential diagnosis for cancers of unknown primary (CUP). CUPs represent 1-2% of all cancers and have poor prognosis because modern cancer treatment is specific to the primary. We present TOAD (Tumor Origin Assessment via Deep-learning) for predicting the primary origin of these tumors from H&E images without using immunohistochemistry, molecular testing or clinical correlation. Our model is trained on 22,833 gigapixel diagnostic whole slide images (WSIs) from 18 different primary cancer origins and tested on an held-out set of 6,499 (WSIs) and an external set of 682 WSIs from 200+ institutions. Furthermore, we curated a large multi-institutional dataset of 743 CUP cases originiating in 150+ different medical centers and validated our model against a subset of 317 cases for which a primary differential was assigned based on evidence from extensive IHC testing, radiologic and/or clinical correlation.*
10
11
Β© This code is made available for non-commercial academic purposes. 
12
13
## TOAD: Tumor Origin Assessement via Deep-learning
14
15
## Pre-requisites:
16
* Linux (Tested on Ubuntu 18.04)
17
* NVIDIA GPU (Tested on Nvidia GeForce RTX 2080 Ti x 16)
18
* Python (3.7.7), h5py (2.10.0), matplotlib (3.1.1), numpy (1.18.1), opencv-python (4.1.1), openslide-python (1.1.1), openslide (3.4.1), pandas (1.0.3), pillow (7.0.0), PyTorch (1.5.1), scikit-learn (0.22.1), scipy (1.3.1), tensorflow (1.14.0), tensorboardx (1.9), torchvision (0.6).
19
20
### Installation Guide for Linux (using anaconda)
21
[Installation Guide](https://github.com/mahmoodlab/CLAM/blob/master/docs/INSTALLATION.md)
22
23
### Data Preparation
24
We chose to encode each tissue patch with a 1024-dim feature vector using a truncated, pretrained ResNet50. For each WSI, these features are expected to be saved as matrices of torch tensors of size N x 1024, where N is the number of patches from each WSI (varies from slide to slide). The following folder structure is assumed:
25
```bash
26
DATA_ROOT_DIR/
27
    └──DATASET_DIR/
28
         β”œβ”€β”€ h5_files
29
                β”œβ”€β”€ slide_1.h5
30
                β”œβ”€β”€ slide_2.h5
31
                └── ...
32
         └── pt_files
33
                β”œβ”€β”€ slide_1.pt
34
                β”œβ”€β”€ slide_2.pt
35
                └── ...
36
```
37
DATA_ROOT_DIR is the base directory of all datasets (e.g. the directory to your SSD). DATASET_DIR is the name of the folder containing data specific to one experiment and features from each slide is stored as .pt files.
38
39
Please refer to refer to [CLAM](https://github.com/mahmoodlab/CLAM) for examples on how perform this feature extraction step.
40
41
### Datasets
42
Datasets are expected to be prepared in a csv format containing at least 5 columns: **case_id**, **slide_id**, **sex**, and labels columns for the slide-level labels: **label**, **site**. Each **case_id** is a unique identifier for a patient, while the **slide_id** is a unique identifier for a slide that correspond to the name of an extracted feature .pt file. This is necessary because often one patient has multiple slides, which might also have different labels. When train/val/test splits are created, we also make sure that slides from the same patient do not go to different splits. The slide ids should be consistent with what was used during the feature extraction step. We provide a dummy example of a dataset csv file in the **dataset_csv** folder, named **dummy_dataset.csv**. You are free to input the labels for your data in any way as long as you specify the appropriate dictionary maps under the **label_dicts** argument of the dataset object's constructor (see below). For demonstration purposes, we used 'M' and 'F' for sex and 'Primary' and 'Metastatic' for the site. Our 18 classes of tumor origins are labaled by 'Lung', 'Breast', 'Colorectal', 'Ovarian', 'Pancreatobiliary', 'Adrenal', 'Skin', 'Prostate', 'Renal', 'Bladder', 'Esophagogastric',  'Thyroid', 'Head Neck',  'Glioma', 'Germ Cell', 'Endometrial', 'Cervix', and 'Liver'.
43
44
Dataset objects used for actual training/validation/testing can be constructed using the **Generic_MIL_MTL_Dataset** Class (defined in **datasets/dataset_mtl_concat.py**). Examples of such dataset objects passed to the models can be found in both **main_mtl_concat.py** and **eval_mtl_concat.py**. 
45
46
For training, look under main.py:
47
```python 
48
if args.task == 'dummy_mtl_concat':
49
    args.n_classes=18
50
    dataset = Generic_MIL_MTL_Dataset(csv_path = 'dataset_csv/dummy_dataset.csv',
51
                            data_dir= os.path.join(args.data_root_dir,'DATASET_DIR'),
52
                            shuffle = False, 
53
                            seed = args.seed, 
54
                            print_info = True,
55
                            label_dicts = [{'Lung':0, 'Breast':1, 'Colorectal':2, 'Ovarian':3, 
56
                                            'Pancreatobiliary':4, 'Adrenal':5, 
57
                                             'Skin':6, 'Prostate':7, 'Renal':8, 'Bladder':9, 
58
                                             'Esophagogastric':10,  'Thyroid':11,
59
                                             'Head Neck':12,  'Glioma':13, 
60
                                             'Germ Cell':14, 'Endometrial': 15, 
61
                                             'Cervix': 16, 'Liver': 17},
62
                                            {'Primary':0,  'Metastatic':1},
63
                                            {'F':0, 'M':1}],
64
                            label_cols = ['label', 'site', 'sex'],
65
                            patient_strat= False)
66
```
67
In addition to the number of classes (args.n_classes), the following arguments need to be specified:
68
* csv_path (str): Path to the dataset csv file
69
* data_dir (str): Path to saved .pt features for the dataset
70
* label_dicts (list of dict): List of dictionaries with key, value pairs for converting str labels to int for each label column
71
* label_cols (list of str): List of column headings to use as labels and map with label_dicts
72
73
Finally, the user should add this specific 'task' specified by this dataset object to be one of the choices in the --task arguments as shown below:
74
75
```python
76
parser.add_argument('--task', type=str, choices=['dummy_mtl_concat'])
77
```
78
79
### Training Splits
80
For evaluating the algorithm's performance, we randomly partitioned our dataset into training, validation and test splits. An example 70/10/20 splits for the dummy dataset can be fould in **splits/dummy_mtl_concat**. These splits can be automatically generated using the create_splits.py script with minimal modification just like with **main_mtl_concat.py**. For example, the dummy splits were created by calling:
81
 
82
``` shell
83
python create_splits.py --task dummy_mtl_concat --seed 1 --k 1
84
```
85
The script uses the **Generic_WSI_MTL_Dataset** Class for which the constructor expects the same arguments as 
86
**Generic_MIL_MTL_Dataset** (without the data_dir argument). For details, please refer to the dataset definition in **datasets/dataset_mtl_concat.py**
87
88
### Training
89
``` shell
90
CUDA_VISIBLE_DEVICES=0 python main_mtl_concat.py --drop_out --early_stopping --lr 2e-4 --k 1 --exp_code dummy_mtl_sex  --task dummy_mtl_concat  --log_data  --data_root_dir DATA_ROOT_DIR
91
```
92
The GPU to use for training can be specified using CUDA_VISIBLE_DEVICES, in the example command, GPU 0 is used. Other arguments such as --drop_out, --early_stopping, --lr, --reg, and --max_epochs can be specified to customize your experiments. 
93
94
For information on each argument, see:
95
``` shell
96
python main_mtl_concat.py -h
97
```
98
99
By default results will be saved to **results/exp_code** corresponding to the exp_code input argument from the user. If tensorboard logging is enabled (with the arugment toggle --log_data), the user can go into the results folder for the particular experiment, run:
100
``` shell
101
tensorboard --logdir=.
102
```
103
This should open a browser window and show the logged training/validation statistics in real time. 
104
105
### Evaluation 
106
User also has the option of using the evluation script to test the performances of trained models. Examples corresponding to the models trained above are provided below:
107
``` shell
108
CUDA_VISIBLE_DEVICES=0 python eval_mtl_concat.py --drop_out --k 1 --models_exp_code dummy_mtl_sex_s1 --save_exp_code dummy_mtl_sex_s1_eval --task study_v2_mtl_sex  --results_dir results --data_root_dir DATA_ROOT_DIR
109
```
110
111
For information on each commandline argument, see:
112
``` shell
113
python eval_mtl_concat.py -h
114
```
115
116
To test trained models on your own custom datasets, you can add them into **eval_mtl_concat.py**, the same way as you do for **main_mtl_concat.py**.
117
118
<img src="github_heatmap.jpg" width="1000px" align="center" />
119
120
## Issues
121
- Please report all issues on the public forum.
122
123
## License
124
Β© [Mahmood Lab](http://www.mahmoodlab.org) - This code is made available under the GPLv3 License and is available for non-commercial academic purposes. 
125
126
## Reference
127
If you find our work useful in your research or if you use parts of this code please consider citing our paper:
128
129
Lu, M.Y., Chen, T.Y., Williamson, D.F.K. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021). https://doi.org/10.1038/s41586-021-03512-4
130
131
```
132
@article{lu2021ai,
133
  title={AI-based pathology predicts origins for cancers of unknown primary},
134
  author={Lu, Ming Y and Chen, Tiffany Y and Williamson, Drew FK and Zhao, Melissa and Shady, Maha and Lipkova, Jana and Mahmood, Faisal},
135
  journal={Nature},
136
  volume={594},
137
  number={7861},
138
  pages={106--110},
139
  year={2021},
140
  publisher={Nature Publishing Group}
141
}
142
```