a b/docs/case_study.md
1
# Case study: Analyzing TCGA HCC dataset
2
3
In this example, we will use the RNA-Seq, miRNA, and DNA Methylation datsets from the TCGA HCC cancer dataset to perform subtype detection, identify subtype specific features, and fit supervised model that we will use to project the HCC samples using only the RNA-Seq OMIC layer. This real case dataset is available directly inside the `data` folder from the package.
4
5
## Dataset preparation
6
7
First, locate the data folder and the compressed matrices:
8
9
```bash
10
data
11
├── meth.tsv.gz
12
├── mir.tsv.gz
13
├── rna.tsv.gz
14
└── survival.tsv
15
```
16
17
Go to that folder (**cd ./data/**) and decompress these files using `gzip -d *.gz`.
18
Now, go back to the main folder (**cd ../**), and we are ready to instanciate a DeepProg instance.
19
20
```python
21
from simdeep.simdeep_boosting import SimDeepBoosting
22
from simdeep.config import PATH_THIS_FILE
23
24
from collections import OrderedDict
25
from os.path import isfile
26
27
# specify your data path
28
path_data = ‘./data/’
29
30
assert(isfile(path_data + "/meth.tsv"))
31
assert(isfile(path_data + "/rna.tsv"))
32
assert(isfile(path_data + "/mir.tsv"))
33
34
tsv_files = OrderedDict([
35
    ('MIR', 'mir.tsv'),
36
    ('METH', 'meth.tsv'),
37
    ('RNA', 'rna.tsv'),
38
])
39
40
# The survival file located also in the same folder
41
survival_tsv = 'survival.tsv'
42
43
assert(isfile(path_data + "survival.tsv"))
44
45
# More attributes
46
PROJECT_NAME = 'HCC_dataset' # Name
47
EPOCHS = 10 # autoencoder fitting epoch
48
SEED = 10045 # random seed
49
nb_it = 10 # Number of submodels to be fitted
50
nb_threads = 2 # Number of python threads used to fit survival model
51
```
52
53
We need also to specify the columns to use from the survival file:
54
55
```bash
56
head data/survival.tsv
57
58
Samples days    event
59
TCGA.2V.A95S.01 0       0
60
TCGA.2Y.A9GS.01 724     1
61
TCGA.2Y.A9GT.01 1624    1
62
TCGA.2Y.A9GU.01 1939    0
63
TCGA.2Y.A9GV.01 2532    1
64
TCGA.2Y.A9GW.01 1271    1
65
TCGA.2Y.A9GX.01 2442    0
66
TCGA.2Y.A9GY.01 757     1
67
TCGA.2Y.A9GZ.01 848     1
68
69
```
70
71
```python
72
survival_flag = {
73
    'patient_id': 'Samples',
74
    'survival': 'days',
75
    'event': 'event'}
76
```
77
78
Now we define a ray instance to distribute the fitting of the submodels
79
```python
80
81
import ray
82
ray.init(webui_host='0.0.0.0', num_cpus=3)
83
```
84
85
## Model fitting
86
87
We are now ready to instanciate a DeepProg instance and to fit a model
88
89
```python
90
# Instanciate a DeepProg instance
91
boosting = SimDeepBoosting(
92
    nb_threads=nb_threads,
93
    nb_it=nb_it,
94
    split_n_fold=3,
95
    survival_tsv=survival_tsv,
96
    training_tsv=tsv_files,
97
    path_data=path_data,
98
    project_name=PROJECT_NAME,
99
    path_results=path_data,
100
    epochs=EPOCHS,
101
    survival_flag=survival_flag,
102
    distribute=True,
103
    seed=SEED)
104
105
boosting.fit()
106
107
# predict labels of the training
108
109
boosting.predict_labels_on_full_dataset()
110
boosting.compute_clusters_consistency_for_full_labels()
111
boosting.evalutate_cluster_performance()
112
boosting.collect_cindex_for_test_fold()
113
boosting.collect_cindex_for_full_dataset()
114
115
boosting.compute_feature_scores_per_cluster()
116
boosting.write_feature_score_per_cluster()
117
```
118
119
## Visualisation and analysis
120
121
We should obtain subtypes with very significant survival differences, as we can see in the results located in the results folder
122
123
![HCC KM plot](./img/HCC_dataset_KM_plot_boosting_full.png)
124
125
Now we might want to project the training samples using only the RNA-Seq layer
126
127
```python
128
boosting.load_new_test_dataset(
129
    {'RNA': 'rna.tsv'},
130
    'test_RNA_only',
131
    survival_tsv,
132
)
133
134
boosting.predict_labels_on_test_dataset()
135
boosting.compute_c_indexes_for_test_dataset()
136
boosting.compute_clusters_consistency_for_test_labels()
137
```
138
139
We can use the visualisation functions to project our samples into a 2D space
140
141
```python
142
# Experimental method to plot the test dataset amongst the class kernel densities
143
boosting.plot_supervised_kernel_for_test_sets()
144
boosting.plot_supervised_predicted_labels_for_test_sets()
145
```
146
Results for unsupervised projection
147
148
![Unsupervised KDE plot](./img/HCC_dataset_KM_plot_boosting_full_kde_unsupervised.png)
149
150
Results for supervised projection
151
152
![Supervised KDE plot](./img/HCC_dataset_KM_plot_boosting_full_kde_supervised.png)