Diff of /docs/usage.md [000000] .. [53737a]

Switch to unified view

a b/docs/usage.md
1
# Tutorial: Simple DeepProg model
2
3
The principle of DeepProg can be summarized as follow:
4
* Loading of multiple samples x OMIC matrices
5
* Preprocessing ,normalisation, and sub-sampling of the input matrices
6
* Matrix transformation using autoencoder
7
* Detection of survival features
8
* Survival feature agglomeration and clustering
9
* Creation of supervised models to predict the output of new samples
10
11
## Input parameters
12
13
All the default parameters are defined in the config file: `./simdeep/config.py` but can be passed dynamically. Three types of parameters must be defined:
14
  * The training dataset (omics + survival input files)
15
    * In addition, the parameters of the test set, i.e. the omic dataset and the survival file
16
  * The parameters of the autoencoder (the default parameters works but it might be fine-tuned.
17
  * The parameters of the classification procedures (default are still good)
18
19
20
## Input matrices
21
22
As examples, we included two datasets:
23
* A dummy example dataset in the `example/data/` folder:
24
```bash
25
examples
26
├── data
27
│   ├── meth_dummy.tsv
28
│   ├── mir_dummy.tsv
29
│   ├── rna_dummy.tsv
30
│   ├── rna_test_dummy.tsv
31
│   ├── survival_dummy.tsv
32
│   └── survival_test_dummy.tsv
33
```
34
35
* And a real dataset in the `data` folder. This dataset derives from the TCGA HCC cancer dataset. This dataset needs to be decompressed before processing:
36
37
```bash
38
data
39
├── meth.tsv.gz
40
├── mir.tsv.gz
41
├── rna.tsv.gz
42
└── survival.tsv
43
44
```
45
46
An input matrix file should follow this format:
47
48
```bash
49
head mir_dummy.tsv
50
51
Samples        dummy_mir_0     dummy_mir_1     dummy_mir_2     dummy_mir_3 ...
52
sample_test_0  0.469656032287  0.347987447237  0.706633335508  0.440068758445 ...
53
sample_test_1  0.0453108219657 0.0234642968791 0.593393816691  0.981872970341 ...
54
sample_test_2  0.908784043793  0.854397550009  0.575879144667  0.553333958713 ...
55
...
56
57
```
58
59
Also, if multiple matrices are used as input, they must keep the sample order. For example:
60
61
```bash
62
head rna_dummy.tsv
63
64
Samples        dummy_gene_0     dummy_gene_1     dummy_gene_2     dummy_gene_3 ...
65
sample_test_0  0.69656032287  0.47987447237  0.06633335508  0.40068758445 ...
66
sample_test_1  0.53108219657 0.234642968791 0.93393816691  0.81872970341 ...
67
sample_test_2  0.8784043793  0.54397550009  0.75879144667  0.53333958713 ...
68
...
69
70
```
71
72
The  arguments `training_tsv` and `path_data` from the `extract_data` module are used to defined the input matrices.
73
74
```python
75
# The keys/values of this dict represent the name of the omic and the corresponding input matrix
76
training_tsv = {
77
    'GE': 'rna_dummy.tsv',
78
    'MIR': 'mir_dummy.tsv',
79
    'METH': 'meth_dummy.tsv',
80
}
81
```
82
83
a survival file must have this format:
84
85
```bash
86
head survival_dummy.tsv
87
88
barcode        days recurrence
89
sample_test_0  134  1
90
sample_test_1  291  0
91
sample_test_2  125  1
92
sample_test_3  43   0
93
...
94
95
```
96
97
In addition, the fields corresponding to the patient IDs, the survival time, and the event should be defined using the `survival_flag` argument:
98
99
```python
100
#Default value
101
survival_flag = {'patient_id': 'barcode',
102
                  'survival': 'days',
103
                 'event': 'recurrence'}
104
```
105
106
## Creating a simple DeepProg model with one autoencoder for each omic
107
108
First, we will build a model using the example dataset from `./examples/data/` (These example files are set as default in the config.py file). We will use them to show how to construct a single DeepProg model inferring a autoencoder for each omic
109
110
```python
111
112
# SimDeep class can be used to build one model with one autoencoder for each omic
113
from simdeep.simdeep_analysis import SimDeep
114
from simdeep.extract_data import LoadData
115
116
help(SimDeep) # to see all the functions
117
help(LoadData) # to see all the functions related to loading datasets
118
119
# Defining training datasets
120
from simdeep.config import TRAINING_TSV
121
from simdeep.config import SURVIVAL_TSV
122
# Location of the input matrices and survival file
123
from simdeep.config import PATH_DATA
124
125
dataset = LoadData(training_tsv=TRAINING_TSV,
126
        survival_tsv=SURVIVAL_TSV,
127
        path_data=PATH_DATA)
128
129
# Defining the result path in which will be created an output folder
130
PATH_RESULTS = "./TEST_DUMMY/"
131
132
# instantiate the model with the dummy example training dataset defined in the config file
133
simDeep = SimDeep(
134
        dataset=dataset,
135
        path_results=PATH_RESULTS,
136
        path_to_save_model=PATH_RESULTS, # This result path can be used to save the autoencoder
137
        )
138
139
simDeep.load_training_dataset() # load the training dataset
140
simDeep.fit() # fit the model
141
```
142
143
At that point, the model is fitted and some output files are available in the output folder:
144
145
```bash
146
TEST_DUMMY
147
├── test_dummy_dataset_KM_plot_training_dataset.png
148
└── test_dummy_dataset_training_set_labels.tsv
149
```
150
151
The tsv file contains the label and the label probability for each sample:
152
153
```bash
154
sample_test_0   1       7.22678272919e-12
155
sample_test_1   1       4.48594196888e-09
156
sample_test_4   1       1.53363205571e-06
157
sample_test_5   1       6.72170409655e-08
158
sample_test_6   0       0.9996581662
159
sample_test_7   1       3.38139255666e-08
160
```
161
162
And we also have the visualisation of a Kaplan-Meier Curve:
163
164
![KM plot](./img/test_dummy_dataset_KM_plot_training_dataset.png)
165
166
Now we are ready to use a test dataset and to infer the class label for the test samples.
167
The test dataset do not need to have the same input omic matrices than the training dataset and not even the sample features for a given omic. However, it needs to have at least some features in common.
168
169
```python
170
# Defining test datasets
171
from simdeep.config import TEST_TSV
172
from simdeep.config import SURVIVAL_TSV_TEST
173
174
simDeep.load_new_test_dataset(
175
    TEST_TSV,
176
    fname_key='dummy',
177
    path_survival_file=SURVIVAL_TSV_TEST, # [OPTIONAL] test survival file useful to compute accuracy of test dataset
178
179
    )
180
181
# The test set is a dummy rna expression (generated randomly)
182
print(simDeep.dataset.test_tsv) # Defined in the config file
183
# The data type of the test set is also defined to match an existing type
184
print(simDeep.dataset.data_type) # Defined in the config file
185
simDeep.predict_labels_on_test_dataset() # Perform the classification analysis and label the set dataset
186
187
print(simDeep.test_labels)
188
print(simDeep.test_labels_proba)
189
190
```
191
192
The assigned class and class probabilities for the test samples are now available in the output folder:
193
194
```bash
195
TEST_DUMMY
196
├── test_dummy_dataset_dummy_KM_plot_test.png
197
├── test_dummy_dataset_dummy_test_labels.tsv
198
├── test_dummy_dataset_KM_plot_training_dataset.png
199
└── test_dummy_dataset_training_set_labels.tsv
200
201
head test_dummy_dataset_training_set_labels.tsv
202
203
204
205
```
206
207
And a KM plot is also constructed using the test labels
208
209
![KM plot test](./img/test_dummy_dataset_dummy_KM_plot_test.png)
210
211
Finally, it is possible to save the keras model:
212
213
```python
214
simDeep.save_encoders('dummy_encoder.h5')
215
```