a b/data/TCGA_GBMLGG/README.md
1
# Reproducibility of "Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis"
2
3
## Setup
4
### 1. Processed Dataset
5
Processed data can be downloaded from our [Google Drive](https://drive.google.com/drive/folders/1swiMrz84V3iuzk8x99vGIBd5FCVncOlf?usp=sharing).
6
7
The data directory structure for TCGA-GBM + TCGA-LGG validation is listed below. 
8
- **all_datasets.csv**: Contains survival time, censor status, IDH mutation status, and CNV data for 769 TCGA IDs.
9
- **grade_data.csv**: Contains age, gender, histologic grade and subtype data for 769 TCGA IDs.
10
- **mRNA_Expression_z-Scores_RNA_Seq_RSEM.txt**: Contains mRNAseq data for the TCGA-GBM project (obtained from the top differentially expressed genes from [cBioPortal](https://www.cbioportal.org/)).
11
- **mRNA_Expression_Zscores_RSEM.txt**: Contains mRNAseq data for the TCGA-LGG project (obtained from the top differentially expressed genes from [cBioPortal](https://www.cbioportal.org/)).
12
- **pnas_splits.csv**: Splits from [Mobadersany et al.](https://github.com/CancerDataScience/SCNN) used for 15-fold cross-validation.
13
- **all_st**: 1505 1024 X 1024 histology ROIs for the 769 TCGA IDs (Stain Normalized) used for training Histology CNN
14
- **all_st_cpc_img/pt_bi/**: Graph features for the 1505 histology ROIs used for training Histology GCN
15
- **all_st_patches_512**: 13545 512 X 512 patches (9 overlapping (stride = 256) patches extracted per image in all_st) used for testing Histology CNN, and training + testing Pathomic Fusion. Instead of random cropping, **all_st_patches_512** can be interpretted as fixed crops per image.
16
- **all_st_patches_512_cpc**: Graph features for the 13545 histology ROIs used for training Histology GCN. Since we did not need to use a patch-based strategy for training the GCN, these .pt files are .pt files duplicated from **all_st_cpc_img** to align the graph and image input before loading it in the PyTorch Dataset Loader.
17
- **splits**: Pickle files containing the data splits for 15-fold cross-validation. Depending on the task (grade vs. survival) or model being trained (CNN, GCN, SNN, Pathomic Fusion), missing data was excluded. In the pickle filename,  the string "all_st" vs. "all_st_patches_512" indicates that the genomic data was aligned with the 1024 X 1024 images in **all_st/all_st_cpc** or 512 X 512 images in **all_st_patches_512 / all_st_patches_512_cpc**. The ending string with pattern "INT_INT_INT_STR" indicates: 0/1 for if we should ignore patients with missing molecular subtype, 0/1 for if we should ignore patients with missing histology subtype, 0/1 for we should ignore patients with missing molecular subtype, 0/1 for if we should use extracted VGG19 embeddings from **all_st_patches_512** for Pathomic Fusion, and "rnaseq" for if we should use RNAseq. Additional details can be found in **make_splits.py**.
18
  
19
```bash
20
./
21
└── data
22
      └── TCGA_GBMLGG
23
            ├── all_datasets.csv
24
            ├── grade_data.csv
25
            ├── mRNA_Expression_z-Scores_RNA_Seq_RSEM.txt
26
            ├── mRNA_Expression_Zscores_RSEM.txt
27
            ├── pnas_splits.csv
28
            ├── gbmlgg
29
                  ├── all_st
30
                        ├── TCGA-02-0001-01Z-00-DX1.83fce43e-42ac-4dcd-b156-2908e75f2e47_1.png
31
                        ├── TCGA-02-0001-01Z-00-DX2.b521a862-280c-4251-ab54-5636f20605d0_1.png
32
                        ├── ...
33
                  ├── all_st_cpc
34
                        └── pt
35
                            ├── TCGA-02-0001-01Z-00-DX1.83fce43e-42ac-4dcd-b156-2908e75f2e47_1.pt 
36
                            ├── TCGA-02-0001-01Z-00-DX2.b521a862-280c-4251-ab54-5636f20605d0_1.pt
37
                            ├── ...
38
                  ├── all_st_patches_512
39
                        ├── TCGA-02-0001-01Z-00-DX1.83fce43e-42ac-4dcd-b156-2908e75f2e47_1_0_0.png
40
                        ├── TCGA-02-0001-01Z-00-DX1.83fce43e-42ac-4dcd-b156-2908e75f2e47_1_0_256.png
41
                        ├── ...
42
                  ├── all_st_patches_512_cpc
43
                        └── pt
44
                            ├── TCGA-02-0001-01Z-00-DX1.83fce43e-42ac-4dcd-b156-2908e75f2e47_1_0_0.pt 
45
                            ├── TCGA-02-0001-01Z-00-DX1.83fce43e-42ac-4dcd-b156-2908e75f2e47_1_0_256.pt 
46
                            ├── ...
47
            └── splits
48
                  ├── gbmlgg15cv_all_st_0_0_0.pkl
49
                  ├── gbmlgg15cv_all_st_0_1_0.pkl
50
                  ├── ...
51
      ├──  Other (Paired) Datasets :) 
52
```
53
54
### 2. Pretrained Models
55
All pretrained models and predictions can be downloaded from our [Google Drive](https://drive.google.com/drive/folders/1swiMrz84V3iuzk8x99vGIBd5FCVncOlf?usp=sharing), and are organized as follows below.
56
```bash
57
./
58
└── checkpoints
59
    ├── surv_15
60
        ├── path
61
            ├── path_1.pt
62
            ├── path_1_pred_train.pkl
63
            ├── path_1_pred_test.pkl
64
            ├── ...
65
        ├── ...
66
    └── grad_15
67
        ├── path
68
            ├── ...
69
        ├── ...
70
```
71
where "surv_15" and "grad_15" refers to the 15-fold cross-validation on Pathomic Fusion for survival outcome prediction and grade classification respectively.
72
73
### Training
74
Commands for training each model:
75
76
##### Histology CNN
77
```
78
python train_cv.py  --exp_name surv_15_rnaseq --task surv --mode path --model_name path --niter 0 --niter_decay 50 --batch_size 8 --lr 0.0005 --reg_type none --lambda_reg 0 --gpu_ids 0
79
python test_cv.py  --exp_name surv_15_rnaseq --task surv --mode path --model_name path --niter 0 --niter_decay 50 --batch_size 8 --lr 0.0005 --reg_type none --lambda_reg 0 --gpu_ids 0 --use_vgg_features 1
80
python train_cv.py  --exp_name grad_15 --task grad --mode path --model_name path --niter 0 --niter_decay 50 --batch_size 8 --lr 0.0005 --reg_type none --lambda_reg 0 --act LSM --label_dim 3 --gpu_ids 0
81
python test_cv.py  --exp_name grad_15 --task grad --mode path --model_name path --niter 0 --niter_decay 50 --batch_size 8 --lr 0.0005 --reg_type none --lambda_reg 0 --act LSM --label_dim 3 --gpu_ids 0 --use_vgg_features 1
82
```
83
84
##### Histology GCN
85
```
86
python train_cv.py  --exp_name surv_15_rnaseq --task surv --mode graph --model_name graph --niter 0 --niter_decay 50 --lr 0.002 --init_type max --reg_type none --lambda_reg 0 -use_vgg_features 1 --gpu_ids 0
87
python train_cv.py  --exp_name grad_15 --task grad --mode graph --model_name graph --niter 0 --niter_decay 50 --lr 0.002 --init_type max --reg_type none --lambda_reg 0 -use_vgg_features 1 --act LSM --label_dim 3 --gpu_ids 0
88
```
89
90
##### Genomic SNN
91
```
92
python train_cv.py --exp_name surv_15_rnaseq --task surv --mode omic --model_name omic --niter 0 --niter_decay 50 --batch_size 64 --reg_type all --init_type max --lr 0.002 --weight_decay 5e-4 --gpu_ids 0 --use_rnaseq 1 --input_size_omic 320 --verbose 1
93
python train_cv.py --exp_name grad_15 --task grad --mode omic --model_name omic --niter 0 --niter_decay 50 --batch_size 64 --reg_type all --init_type max --lr 0.002 --weight_decay 5e-4 --act LSM --label_dim 3 --gpu_ids 0
94
```
95
96
##### Pathomic Fusion (CNN+SNN)
97
```
98
python train_cv.py --exp_name surv_15_rnaseq --task surv --mode pathomic --model_name pathomic_fusion --niter 10 --niter_decay 20 --lr 0.0001 --beta1 0.5 --fusion_type pofusion --mmhid 64 --use_bilinear 1 --use_vgg_features 1 --gpu_ids 0 --omic_gate 0 --use_rnaseq 1 --input_size_omic 320
99
python train_cv.py --exp_name grad_15 --task grad --mode pathomic --model_name pathomic_fusion --niter 10 --niter_decay 20 --lr 0.0001 --beta1 0.5 --fusion_type pofusion --mmhid 64 --use_bilinear 1 --use_vgg_features 1 --gpu_ids 0 --path_gate 0 --omic_scale 2 --act LSM --label_dim 3
100
```
101
102
##### Pathomic Fusion (GCN+SNN)
103
```
104
python train_cv.py --exp_name surv_15_rnaseq --task surv --mode graphomic --model_name graphomic_fusion --niter 10 --niter_decay 20 --lr 0.0001 --beta1 0.5 --fusion_type pofusion --mmhid 64 --use_bilinear 1 --use_vgg_features 1 --gpu_ids 0 --omic_gate 0 --grph_scale 2 --use_rnaseq 1 --input_size_omic 320
105
python train_cv.py --exp_name grad_15 --task grad --mode graphomic --model_name graphomic_fusion --niter 10 --niter_decay 20 --lr 0.0001 --beta1 0.5 --fusion_type pofusion --mmhid 64 --use_bilinear 1 --use_vgg_features 1 --gpu_ids 0 --grph_gate 0 --omic_scale 2 --act LSM --label_dim 3
106
```
107
108
##### Pathomic Fusion (CNN+GCN+SNN)
109
```
110
python train_cv.py --exp_name surv_15_rnaseq --task surv --mode pathgraphomic --model_name pathgraphomic_fusion --niter 10 --niter_decay 20 --lr 0.0001 --beta1 0.5 --fusion_type pofusion_A --mmhid 64 --use_bilinear 1 --use_vgg_features 1 --gpu_ids 0 --omic_gate 0 --grph_scale 2 --use_rnaseq 1 --input_size_omic 320
111
python train_cv.py --exp_name grad_15 --task grad --mode pathgraphomic --model_name pathgraphomic_fusion --niter 10 --niter_decay 20 --lr 0.0001 --beta1 0.5 --fusion_type pofusion_B --mmhid 64 --use_bilinear 1 --use_vgg_features 1 --gpu_ids 0 --path_gate 0 --act LSM --label_dim 3
112
```
113
114
115
### Working with the Raw TCGA Data
116
 Raw histology region-of-interests for the TCGA-GBM and TCGA-LGG projects can be downloaded from [Mobadersany et al.](https://github.com/CancerDataScience/SCNN). For stain normalization, we used a python implementation of Sparse Stain Normalization from [Vahdane et al.](https://github.com/abhishekvahadane/CodeRelease_ColorNormalization) implemented in [StainTools](https://github.com/Peter554/StainTools). 
117
 
118
### Issues
119
- Please open new threads or report issues to richardchen@g.harvard.edu.
120
121
## License
122
This project is licensed under the GNU GPLv3 License - see the [LICENSE.md](LICENSE.md) file for details
123
124
## Acknowledgments
125
- This code is inspired by [SALMON](https://github.com/huangzhii/SALMON), [pytorch-CycleGAN-and-pix2pix](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix), and [SCNN](https://github.com/CancerDataScience/SCNN).
126
* Subsidized computing resources were provided by Nvidia and Google Cloud.
127
128
## Reference
129
If you find our work useful in your research, please consider citing our paper at:
130
```
131
@article{chen2020pathomic,
132
  title={Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis},
133
  author={Chen, Richard J and Lu, Ming Y and Wang, Jingwen and Williamson, Drew FK and Rodig, Scott J and Lindeman, Neal I and Mahmood, Faisal},
134
  journal={IEEE Transactions on Medical Imaging},
135
  year={2020},
136
  publisher={IEEE}
137
}
138
```
139
© Mahmood Lab - This code is made available under the GPLv3 License and is available for non-commercial academic purposes.