Diff of /docs/usage_ensemble.md [000000] .. [53737a]

Switch to unified view

a b/docs/usage_ensemble.md
1
# Tutorial: Ensemble of DeepProg model
2
3
Secondly, we will build a more complex DeepProg model constituted of an ensemble of sub-models, each originated from a subset of the data. For that purpose, we need to use the `SimDeepBoosting` class:
4
5
6
```python
7
from simdeep.simdeep_boosting import SimDeepBoosting
8
9
help(SimDeepBoosting)
10
```
11
12
Similarly, to the SimDeep class, we define our training dataset
13
14
```python
15
# Location of the input matrices and survival file
16
from simdeep.config import PATH_DATA
17
18
from collections import OrderedDict
19
20
# Example tsv files
21
tsv_files = OrderedDict([
22
          ('MIR', 'mir_dummy.tsv'),
23
          ('METH', 'meth_dummy.tsv'),
24
          ('RNA', 'rna_dummy.tsv'),
25
])
26
27
# File with survival event
28
survival_tsv = 'survival_dummy.tsv'
29
30
```
31
32
## Instanciation
33
34
Then, we define arguments specific to DeepProg and instanciate an instance of the class
35
36
```python
37
project_name = 'stacked_TestProject'
38
epochs = 10 # Autoencoder epochs. Other hyperparameters can be fine-tuned. See the example files
39
seed = 3 # random seed used for reproducibility
40
nb_it = 5 # This is the number of models to be fitted using only a subset of the training data
41
nb_threads = 2 # These treads define the number of threads to be used to compute survival function
42
PATH_RESULTS = "./"
43
44
boosting = SimDeepBoosting(
45
    nb_threads=nb_threads,
46
    nb_it=nb_it,
47
    split_n_fold=3,
48
    survival_tsv=survival_tsv,
49
    training_tsv=tsv_files,
50
    path_data=PATH_DATA,
51
    project_name=project_name,
52
    path_results=PATH_RESULTS,
53
    epochs=epochs,
54
    seed=seed)
55
```
56
Here, we define a DeepProg model that will create 5 SimDeep instances each based on a subset of the original training dataset.the number of instance is defined by he `nb_it` argument. Other arguments related to the autoencoders construction can be defined during the class instanciation, such as `epochs`.
57
58
## Fitting
59
Once the model is defined we can fit it
60
61
```python
62
# Fit the model
63
boosting.fit()
64
# Predict and write the labels
65
boosting.predict_labels_on_full_dataset()
66
```
67
68
Some output files are generated in the output folder:
69
70
```bash
71
stacked_TestProject
72
├── stacked_TestProject_full_labels.tsv
73
├── stacked_TestProject_KM_plot_boosting_full.png
74
├── stacked_TestProject_proba_KM_plot_boosting_full.png
75
├── stacked_TestProject_test_fold_labels.tsv
76
└── stacked_TestProject_training_set_labels.tsv
77
```
78
79
The inferred labels, labels probability, survival time, and event are written in the `stacked_TestProject_full_labels.tsv` file:
80
81
```bash
82
sample_test_48  1       0.474781026865  332.0   1.0
83
sample_test_49  1       0.142554926379  120.0   0.0
84
sample_test_46  1       0.355333486034  355.0   1.0
85
sample_test_47  0       0.618825352398  48.0    1.0
86
sample_test_44  1       0.346797097671  179.0   0.0
87
sample_test_45  1       0.0254692404734 360.0   0.0
88
sample_test_42  1       0.441997226254  346.0   1.0
89
sample_test_43  1       0.0783603292911 335.0   0.0
90
sample_test_40  1       0.380182410315  149.0   0.0
91
sample_test_41  0       0.953659261728  155.0   1.0
92
```
93
94
Note that the label probablity corresponds to the probability to belongs to the subtype with the lowest survival rate.
95
Two KM plots are also generated, one using the cluster labels:
96
97
![KM plot 3](./img/stacked_TestProject_KM_plot_boosting_full.png)
98
99
and one using the cluster label probability dichotomized:
100
101
![KM plot 4](./img/stacked_TestProject_proba_KM_plot_boosting_full.png)
102
103
We can also compute the feature importance per cluster:
104
105
```python
106
# oOmpute the feature importance
107
boosting.compute_feature_scores_per_cluster()
108
# Write the feature importance
109
boosting.write_feature_score_per_cluster()
110
```
111
112
The results are updated in the output folder:
113
114
```bash
115
stacked_TestProject
116
├── stacked_TestProject_features_anticorrelated_scores_per_clusters.tsv
117
├── stacked_TestProject_features_scores_per_clusters.tsv
118
├── stacked_TestProject_full_labels.tsv
119
├── stacked_TestProject_KM_plot_boosting_full.png
120
├── stacked_TestProject_proba_KM_plot_boosting_full.png
121
├── stacked_TestProject_test_fold_labels.tsv
122
└── stacked_TestProject_training_set_labels.tsv
123
```
124
125
## Evaluate the models
126
127
DeepProg allows to compute specific metrics relative to the ensemble of models:
128
129
```python
130
# Compute internal metrics
131
boosting.compute_clusters_consistency_for_full_labels()
132
133
# Collect c-index
134
boosting.compute_c_indexes_for_full_dataset()
135
# Evaluate cluster performance
136
boosting.evalutate_cluster_performance()
137
# Collect more c-indexes
138
boosting.collect_cindex_for_test_fold()
139
boosting.collect_cindex_for_full_dataset()
140
boosting.collect_cindex_for_training_dataset()
141
142
# See Ave. number of significant features per omic across OMIC models
143
boosting.collect_number_of_features_per_omic()
144
145
```
146
147
## Predicting on test dataset
148
149
We can then load and evaluate a first test dataset
150
151
```
152
boosting.load_new_test_dataset(
153
    {'RNA': 'rna_dummy.tsv'}, # OMIC file of the test set. It doesnt have to be the same as for training
154
    'TEST_DATA_1', # Name of the test test to be used
155
    'survival_dummy.tsv', # [OPTIONAL] Survival file of the test set. USeful to compute accuracy metrics on the test dataset
156
)
157
158
# Predict the labels on the test dataset
159
boosting.predict_labels_on_test_dataset()
160
# Compute C-index
161
boosting.compute_c_indexes_for_test_dataset()
162
# See cluster consistency
163
boosting.compute_clusters_consistency_for_test_labels()
164
```
165
166
We can load an evaluate a second test dataset
167
168
```python
169
boosting.load_new_test_dataset(
170
    {'MIR': 'mir_dummy.tsv'}, # OMIC file of the test set. It doesnt have to be the same as for training
171
    'TEST_DATA_2', # Name of the test test to be used
172
    'survival_dummy.tsv', # Survival file of the test set
173
)
174
175
# Predict the labels on the test dataset
176
boosting.predict_labels_on_test_dataset()
177
# Compute C-index
178
boosting.compute_c_indexes_for_test_dataset()
179
# See cluster consistency
180
boosting.compute_clusters_consistency_for_test_labels()
181
```
182
183
The output folder is updated with the new output files
184
185
```bash
186
stacked_TestProject
187
├── stacked_TestProject_features_anticorrelated_scores_per_clusters.tsv
188
├── stacked_TestProject_features_scores_per_clusters.tsv
189
├── stacked_TestProject_full_labels.tsv
190
├── stacked_TestProject_KM_plot_boosting_full.png
191
├── stacked_TestProject_proba_KM_plot_boosting_full.png
192
├── stacked_TestProject_TEST_DATA_1_KM_plot_boosting_test.png
193
├── stacked_TestProject_TEST_DATA_1_proba_KM_plot_boosting_test.png
194
├── stacked_TestProject_TEST_DATA_1_test_labels.tsv
195
├── stacked_TestProject_TEST_DATA_2_KM_plot_boosting_test.png
196
├── stacked_TestProject_TEST_DATA_2_proba_KM_plot_boosting_test.png
197
├── stacked_TestProject_TEST_DATA_2_test_labels.tsv
198
├── stacked_TestProject_test_fold_labels.tsv
199
├── stacked_TestProject_test_labels.tsv
200
└── stacked_TestProject_training_set_labels.tsv
201
202
```
203
file: stacked_TestProject_TEST_DATA_1_KM_plot_boosting_test.png
204
205
![test KM plot 1](./img/stacked_TestProject_TEST_DATA_1_KM_plot_boosting_test.png)
206
207
file: stacked_TestProject_TEST_DATA_2_KM_plot_boosting_test.png
208
209
![test KM plot 2](./img/stacked_TestProject_TEST_DATA_2_KM_plot_boosting_test.png)
210
211
## Distributed computation
212
213
Because SimDeepBoosting constructs an ensemble of models, it is well suited to distribute the individual construction of each SimDeep instance. To do such a task, we implemented the use of the ray framework that allow DeepProg to distribute the creation of each submodel on different clusters/nodes/CPUs. The configuration of the nodes / clusters, or local CPUs to be used needs to be done when instanciating a new ray object with the ray [API](https://ray.readthedocs.io/en/latest/). It is however quite straightforward to define the number of instances launched on a local machine such as in the example below in which 3 instances are used.
214
215
216
```python
217
# Instanciate a ray object that will create multiple workers
218
import ray
219
ray.init(webui_host='0.0.0.0', num_cpus=3)
220
# More options can be used (e.g. remote clusters, AWS, memory,...etc...)
221
# ray can be used locally to maximize the use of CPUs on the local machine
222
# See ray API: https://ray.readthedocs.io/en/latest/index.html
223
224
boosting = SimDeepBoosting(
225
    ...
226
    distribute=True, # Additional option to use ray cluster scheduler
227
    ...
228
)
229
...
230
# Processing
231
...
232
233
# Close clusters and free memory
234
ray.shutdown()
235
```
236
237
## More examples
238
239
More example scripts are availables in ./examples/ which will assist you to build a model from scratch with test and real data:
240
241
```bash
242
examples
243
├── create_autoencoder_from_scratch.py # Construct a simple deeprog model on the dummy example dataset
244
├── example_with_dummy_data_distributed.py # Process the dummy example dataset using ray
245
├── example_with_dummy_data.py # Process the dummy example dataset
246
└── load_3_omics_model.py # Process the example HCC dataset
247
248
249
```