|
a |
|
b/docs/usage.md |
|
|
1 |
# Tutorial: Simple DeepProg model |
|
|
2 |
|
|
|
3 |
The principle of DeepProg can be summarized as follow: |
|
|
4 |
* Loading of multiple samples x OMIC matrices |
|
|
5 |
* Preprocessing ,normalisation, and sub-sampling of the input matrices |
|
|
6 |
* Matrix transformation using autoencoder |
|
|
7 |
* Detection of survival features |
|
|
8 |
* Survival feature agglomeration and clustering |
|
|
9 |
* Creation of supervised models to predict the output of new samples |
|
|
10 |
|
|
|
11 |
## Input parameters |
|
|
12 |
|
|
|
13 |
All the default parameters are defined in the config file: `./simdeep/config.py` but can be passed dynamically. Three types of parameters must be defined: |
|
|
14 |
* The training dataset (omics + survival input files) |
|
|
15 |
* In addition, the parameters of the test set, i.e. the omic dataset and the survival file |
|
|
16 |
* The parameters of the autoencoder (the default parameters works but it might be fine-tuned. |
|
|
17 |
* The parameters of the classification procedures (default are still good) |
|
|
18 |
|
|
|
19 |
|
|
|
20 |
## Input matrices |
|
|
21 |
|
|
|
22 |
As examples, we included two datasets: |
|
|
23 |
* A dummy example dataset in the `example/data/` folder: |
|
|
24 |
```bash |
|
|
25 |
examples |
|
|
26 |
├── data |
|
|
27 |
│ ├── meth_dummy.tsv |
|
|
28 |
│ ├── mir_dummy.tsv |
|
|
29 |
│ ├── rna_dummy.tsv |
|
|
30 |
│ ├── rna_test_dummy.tsv |
|
|
31 |
│ ├── survival_dummy.tsv |
|
|
32 |
│ └── survival_test_dummy.tsv |
|
|
33 |
``` |
|
|
34 |
|
|
|
35 |
* And a real dataset in the `data` folder. This dataset derives from the TCGA HCC cancer dataset. This dataset needs to be decompressed before processing: |
|
|
36 |
|
|
|
37 |
```bash |
|
|
38 |
data |
|
|
39 |
├── meth.tsv.gz |
|
|
40 |
├── mir.tsv.gz |
|
|
41 |
├── rna.tsv.gz |
|
|
42 |
└── survival.tsv |
|
|
43 |
|
|
|
44 |
``` |
|
|
45 |
|
|
|
46 |
An input matrix file should follow this format: |
|
|
47 |
|
|
|
48 |
```bash |
|
|
49 |
head mir_dummy.tsv |
|
|
50 |
|
|
|
51 |
Samples dummy_mir_0 dummy_mir_1 dummy_mir_2 dummy_mir_3 ... |
|
|
52 |
sample_test_0 0.469656032287 0.347987447237 0.706633335508 0.440068758445 ... |
|
|
53 |
sample_test_1 0.0453108219657 0.0234642968791 0.593393816691 0.981872970341 ... |
|
|
54 |
sample_test_2 0.908784043793 0.854397550009 0.575879144667 0.553333958713 ... |
|
|
55 |
... |
|
|
56 |
|
|
|
57 |
``` |
|
|
58 |
|
|
|
59 |
Also, if multiple matrices are used as input, they must keep the sample order. For example: |
|
|
60 |
|
|
|
61 |
```bash |
|
|
62 |
head rna_dummy.tsv |
|
|
63 |
|
|
|
64 |
Samples dummy_gene_0 dummy_gene_1 dummy_gene_2 dummy_gene_3 ... |
|
|
65 |
sample_test_0 0.69656032287 0.47987447237 0.06633335508 0.40068758445 ... |
|
|
66 |
sample_test_1 0.53108219657 0.234642968791 0.93393816691 0.81872970341 ... |
|
|
67 |
sample_test_2 0.8784043793 0.54397550009 0.75879144667 0.53333958713 ... |
|
|
68 |
... |
|
|
69 |
|
|
|
70 |
``` |
|
|
71 |
|
|
|
72 |
The arguments `training_tsv` and `path_data` from the `extract_data` module are used to defined the input matrices. |
|
|
73 |
|
|
|
74 |
```python |
|
|
75 |
# The keys/values of this dict represent the name of the omic and the corresponding input matrix |
|
|
76 |
training_tsv = { |
|
|
77 |
'GE': 'rna_dummy.tsv', |
|
|
78 |
'MIR': 'mir_dummy.tsv', |
|
|
79 |
'METH': 'meth_dummy.tsv', |
|
|
80 |
} |
|
|
81 |
``` |
|
|
82 |
|
|
|
83 |
a survival file must have this format: |
|
|
84 |
|
|
|
85 |
```bash |
|
|
86 |
head survival_dummy.tsv |
|
|
87 |
|
|
|
88 |
barcode days recurrence |
|
|
89 |
sample_test_0 134 1 |
|
|
90 |
sample_test_1 291 0 |
|
|
91 |
sample_test_2 125 1 |
|
|
92 |
sample_test_3 43 0 |
|
|
93 |
... |
|
|
94 |
|
|
|
95 |
``` |
|
|
96 |
|
|
|
97 |
In addition, the fields corresponding to the patient IDs, the survival time, and the event should be defined using the `survival_flag` argument: |
|
|
98 |
|
|
|
99 |
```python |
|
|
100 |
#Default value |
|
|
101 |
survival_flag = {'patient_id': 'barcode', |
|
|
102 |
'survival': 'days', |
|
|
103 |
'event': 'recurrence'} |
|
|
104 |
``` |
|
|
105 |
|
|
|
106 |
## Creating a simple DeepProg model with one autoencoder for each omic |
|
|
107 |
|
|
|
108 |
First, we will build a model using the example dataset from `./examples/data/` (These example files are set as default in the config.py file). We will use them to show how to construct a single DeepProg model inferring a autoencoder for each omic |
|
|
109 |
|
|
|
110 |
```python |
|
|
111 |
|
|
|
112 |
# SimDeep class can be used to build one model with one autoencoder for each omic |
|
|
113 |
from simdeep.simdeep_analysis import SimDeep |
|
|
114 |
from simdeep.extract_data import LoadData |
|
|
115 |
|
|
|
116 |
help(SimDeep) # to see all the functions |
|
|
117 |
help(LoadData) # to see all the functions related to loading datasets |
|
|
118 |
|
|
|
119 |
# Defining training datasets |
|
|
120 |
from simdeep.config import TRAINING_TSV |
|
|
121 |
from simdeep.config import SURVIVAL_TSV |
|
|
122 |
# Location of the input matrices and survival file |
|
|
123 |
from simdeep.config import PATH_DATA |
|
|
124 |
|
|
|
125 |
dataset = LoadData(training_tsv=TRAINING_TSV, |
|
|
126 |
survival_tsv=SURVIVAL_TSV, |
|
|
127 |
path_data=PATH_DATA) |
|
|
128 |
|
|
|
129 |
# Defining the result path in which will be created an output folder |
|
|
130 |
PATH_RESULTS = "./TEST_DUMMY/" |
|
|
131 |
|
|
|
132 |
# instantiate the model with the dummy example training dataset defined in the config file |
|
|
133 |
simDeep = SimDeep( |
|
|
134 |
dataset=dataset, |
|
|
135 |
path_results=PATH_RESULTS, |
|
|
136 |
path_to_save_model=PATH_RESULTS, # This result path can be used to save the autoencoder |
|
|
137 |
) |
|
|
138 |
|
|
|
139 |
simDeep.load_training_dataset() # load the training dataset |
|
|
140 |
simDeep.fit() # fit the model |
|
|
141 |
``` |
|
|
142 |
|
|
|
143 |
At that point, the model is fitted and some output files are available in the output folder: |
|
|
144 |
|
|
|
145 |
```bash |
|
|
146 |
TEST_DUMMY |
|
|
147 |
├── test_dummy_dataset_KM_plot_training_dataset.png |
|
|
148 |
└── test_dummy_dataset_training_set_labels.tsv |
|
|
149 |
``` |
|
|
150 |
|
|
|
151 |
The tsv file contains the label and the label probability for each sample: |
|
|
152 |
|
|
|
153 |
```bash |
|
|
154 |
sample_test_0 1 7.22678272919e-12 |
|
|
155 |
sample_test_1 1 4.48594196888e-09 |
|
|
156 |
sample_test_4 1 1.53363205571e-06 |
|
|
157 |
sample_test_5 1 6.72170409655e-08 |
|
|
158 |
sample_test_6 0 0.9996581662 |
|
|
159 |
sample_test_7 1 3.38139255666e-08 |
|
|
160 |
``` |
|
|
161 |
|
|
|
162 |
And we also have the visualisation of a Kaplan-Meier Curve: |
|
|
163 |
|
|
|
164 |
 |
|
|
165 |
|
|
|
166 |
Now we are ready to use a test dataset and to infer the class label for the test samples. |
|
|
167 |
The test dataset do not need to have the same input omic matrices than the training dataset and not even the sample features for a given omic. However, it needs to have at least some features in common. |
|
|
168 |
|
|
|
169 |
```python |
|
|
170 |
# Defining test datasets |
|
|
171 |
from simdeep.config import TEST_TSV |
|
|
172 |
from simdeep.config import SURVIVAL_TSV_TEST |
|
|
173 |
|
|
|
174 |
simDeep.load_new_test_dataset( |
|
|
175 |
TEST_TSV, |
|
|
176 |
fname_key='dummy', |
|
|
177 |
path_survival_file=SURVIVAL_TSV_TEST, # [OPTIONAL] test survival file useful to compute accuracy of test dataset |
|
|
178 |
|
|
|
179 |
) |
|
|
180 |
|
|
|
181 |
# The test set is a dummy rna expression (generated randomly) |
|
|
182 |
print(simDeep.dataset.test_tsv) # Defined in the config file |
|
|
183 |
# The data type of the test set is also defined to match an existing type |
|
|
184 |
print(simDeep.dataset.data_type) # Defined in the config file |
|
|
185 |
simDeep.predict_labels_on_test_dataset() # Perform the classification analysis and label the set dataset |
|
|
186 |
|
|
|
187 |
print(simDeep.test_labels) |
|
|
188 |
print(simDeep.test_labels_proba) |
|
|
189 |
|
|
|
190 |
``` |
|
|
191 |
|
|
|
192 |
The assigned class and class probabilities for the test samples are now available in the output folder: |
|
|
193 |
|
|
|
194 |
```bash |
|
|
195 |
TEST_DUMMY |
|
|
196 |
├── test_dummy_dataset_dummy_KM_plot_test.png |
|
|
197 |
├── test_dummy_dataset_dummy_test_labels.tsv |
|
|
198 |
├── test_dummy_dataset_KM_plot_training_dataset.png |
|
|
199 |
└── test_dummy_dataset_training_set_labels.tsv |
|
|
200 |
|
|
|
201 |
head test_dummy_dataset_training_set_labels.tsv |
|
|
202 |
|
|
|
203 |
|
|
|
204 |
|
|
|
205 |
``` |
|
|
206 |
|
|
|
207 |
And a KM plot is also constructed using the test labels |
|
|
208 |
|
|
|
209 |
 |
|
|
210 |
|
|
|
211 |
Finally, it is possible to save the keras model: |
|
|
212 |
|
|
|
213 |
```python |
|
|
214 |
simDeep.save_encoders('dummy_encoder.h5') |
|
|
215 |
``` |