|
a |
|
b/docs/usage_ensemble.md |
|
|
1 |
# Tutorial: Ensemble of DeepProg model |
|
|
2 |
|
|
|
3 |
Secondly, we will build a more complex DeepProg model constituted of an ensemble of sub-models, each originated from a subset of the data. For that purpose, we need to use the `SimDeepBoosting` class: |
|
|
4 |
|
|
|
5 |
|
|
|
6 |
```python |
|
|
7 |
from simdeep.simdeep_boosting import SimDeepBoosting |
|
|
8 |
|
|
|
9 |
help(SimDeepBoosting) |
|
|
10 |
``` |
|
|
11 |
|
|
|
12 |
Similarly, to the SimDeep class, we define our training dataset |
|
|
13 |
|
|
|
14 |
```python |
|
|
15 |
# Location of the input matrices and survival file |
|
|
16 |
from simdeep.config import PATH_DATA |
|
|
17 |
|
|
|
18 |
from collections import OrderedDict |
|
|
19 |
|
|
|
20 |
# Example tsv files |
|
|
21 |
tsv_files = OrderedDict([ |
|
|
22 |
('MIR', 'mir_dummy.tsv'), |
|
|
23 |
('METH', 'meth_dummy.tsv'), |
|
|
24 |
('RNA', 'rna_dummy.tsv'), |
|
|
25 |
]) |
|
|
26 |
|
|
|
27 |
# File with survival event |
|
|
28 |
survival_tsv = 'survival_dummy.tsv' |
|
|
29 |
|
|
|
30 |
``` |
|
|
31 |
|
|
|
32 |
## Instanciation |
|
|
33 |
|
|
|
34 |
Then, we define arguments specific to DeepProg and instanciate an instance of the class |
|
|
35 |
|
|
|
36 |
```python |
|
|
37 |
project_name = 'stacked_TestProject' |
|
|
38 |
epochs = 10 # Autoencoder epochs. Other hyperparameters can be fine-tuned. See the example files |
|
|
39 |
seed = 3 # random seed used for reproducibility |
|
|
40 |
nb_it = 5 # This is the number of models to be fitted using only a subset of the training data |
|
|
41 |
nb_threads = 2 # These treads define the number of threads to be used to compute survival function |
|
|
42 |
PATH_RESULTS = "./" |
|
|
43 |
|
|
|
44 |
boosting = SimDeepBoosting( |
|
|
45 |
nb_threads=nb_threads, |
|
|
46 |
nb_it=nb_it, |
|
|
47 |
split_n_fold=3, |
|
|
48 |
survival_tsv=survival_tsv, |
|
|
49 |
training_tsv=tsv_files, |
|
|
50 |
path_data=PATH_DATA, |
|
|
51 |
project_name=project_name, |
|
|
52 |
path_results=PATH_RESULTS, |
|
|
53 |
epochs=epochs, |
|
|
54 |
seed=seed) |
|
|
55 |
``` |
|
|
56 |
Here, we define a DeepProg model that will create 5 SimDeep instances each based on a subset of the original training dataset.the number of instance is defined by he `nb_it` argument. Other arguments related to the autoencoders construction can be defined during the class instanciation, such as `epochs`. |
|
|
57 |
|
|
|
58 |
## Fitting |
|
|
59 |
Once the model is defined we can fit it |
|
|
60 |
|
|
|
61 |
```python |
|
|
62 |
# Fit the model |
|
|
63 |
boosting.fit() |
|
|
64 |
# Predict and write the labels |
|
|
65 |
boosting.predict_labels_on_full_dataset() |
|
|
66 |
``` |
|
|
67 |
|
|
|
68 |
Some output files are generated in the output folder: |
|
|
69 |
|
|
|
70 |
```bash |
|
|
71 |
stacked_TestProject |
|
|
72 |
├── stacked_TestProject_full_labels.tsv |
|
|
73 |
├── stacked_TestProject_KM_plot_boosting_full.png |
|
|
74 |
├── stacked_TestProject_proba_KM_plot_boosting_full.png |
|
|
75 |
├── stacked_TestProject_test_fold_labels.tsv |
|
|
76 |
└── stacked_TestProject_training_set_labels.tsv |
|
|
77 |
``` |
|
|
78 |
|
|
|
79 |
The inferred labels, labels probability, survival time, and event are written in the `stacked_TestProject_full_labels.tsv` file: |
|
|
80 |
|
|
|
81 |
```bash |
|
|
82 |
sample_test_48 1 0.474781026865 332.0 1.0 |
|
|
83 |
sample_test_49 1 0.142554926379 120.0 0.0 |
|
|
84 |
sample_test_46 1 0.355333486034 355.0 1.0 |
|
|
85 |
sample_test_47 0 0.618825352398 48.0 1.0 |
|
|
86 |
sample_test_44 1 0.346797097671 179.0 0.0 |
|
|
87 |
sample_test_45 1 0.0254692404734 360.0 0.0 |
|
|
88 |
sample_test_42 1 0.441997226254 346.0 1.0 |
|
|
89 |
sample_test_43 1 0.0783603292911 335.0 0.0 |
|
|
90 |
sample_test_40 1 0.380182410315 149.0 0.0 |
|
|
91 |
sample_test_41 0 0.953659261728 155.0 1.0 |
|
|
92 |
``` |
|
|
93 |
|
|
|
94 |
Note that the label probablity corresponds to the probability to belongs to the subtype with the lowest survival rate. |
|
|
95 |
Two KM plots are also generated, one using the cluster labels: |
|
|
96 |
|
|
|
97 |
 |
|
|
98 |
|
|
|
99 |
and one using the cluster label probability dichotomized: |
|
|
100 |
|
|
|
101 |
 |
|
|
102 |
|
|
|
103 |
We can also compute the feature importance per cluster: |
|
|
104 |
|
|
|
105 |
```python |
|
|
106 |
# oOmpute the feature importance |
|
|
107 |
boosting.compute_feature_scores_per_cluster() |
|
|
108 |
# Write the feature importance |
|
|
109 |
boosting.write_feature_score_per_cluster() |
|
|
110 |
``` |
|
|
111 |
|
|
|
112 |
The results are updated in the output folder: |
|
|
113 |
|
|
|
114 |
```bash |
|
|
115 |
stacked_TestProject |
|
|
116 |
├── stacked_TestProject_features_anticorrelated_scores_per_clusters.tsv |
|
|
117 |
├── stacked_TestProject_features_scores_per_clusters.tsv |
|
|
118 |
├── stacked_TestProject_full_labels.tsv |
|
|
119 |
├── stacked_TestProject_KM_plot_boosting_full.png |
|
|
120 |
├── stacked_TestProject_proba_KM_plot_boosting_full.png |
|
|
121 |
├── stacked_TestProject_test_fold_labels.tsv |
|
|
122 |
└── stacked_TestProject_training_set_labels.tsv |
|
|
123 |
``` |
|
|
124 |
|
|
|
125 |
## Evaluate the models |
|
|
126 |
|
|
|
127 |
DeepProg allows to compute specific metrics relative to the ensemble of models: |
|
|
128 |
|
|
|
129 |
```python |
|
|
130 |
# Compute internal metrics |
|
|
131 |
boosting.compute_clusters_consistency_for_full_labels() |
|
|
132 |
|
|
|
133 |
# Collect c-index |
|
|
134 |
boosting.compute_c_indexes_for_full_dataset() |
|
|
135 |
# Evaluate cluster performance |
|
|
136 |
boosting.evalutate_cluster_performance() |
|
|
137 |
# Collect more c-indexes |
|
|
138 |
boosting.collect_cindex_for_test_fold() |
|
|
139 |
boosting.collect_cindex_for_full_dataset() |
|
|
140 |
boosting.collect_cindex_for_training_dataset() |
|
|
141 |
|
|
|
142 |
# See Ave. number of significant features per omic across OMIC models |
|
|
143 |
boosting.collect_number_of_features_per_omic() |
|
|
144 |
|
|
|
145 |
``` |
|
|
146 |
|
|
|
147 |
## Predicting on test dataset |
|
|
148 |
|
|
|
149 |
We can then load and evaluate a first test dataset |
|
|
150 |
|
|
|
151 |
``` |
|
|
152 |
boosting.load_new_test_dataset( |
|
|
153 |
{'RNA': 'rna_dummy.tsv'}, # OMIC file of the test set. It doesnt have to be the same as for training |
|
|
154 |
'TEST_DATA_1', # Name of the test test to be used |
|
|
155 |
'survival_dummy.tsv', # [OPTIONAL] Survival file of the test set. USeful to compute accuracy metrics on the test dataset |
|
|
156 |
) |
|
|
157 |
|
|
|
158 |
# Predict the labels on the test dataset |
|
|
159 |
boosting.predict_labels_on_test_dataset() |
|
|
160 |
# Compute C-index |
|
|
161 |
boosting.compute_c_indexes_for_test_dataset() |
|
|
162 |
# See cluster consistency |
|
|
163 |
boosting.compute_clusters_consistency_for_test_labels() |
|
|
164 |
``` |
|
|
165 |
|
|
|
166 |
We can load an evaluate a second test dataset |
|
|
167 |
|
|
|
168 |
```python |
|
|
169 |
boosting.load_new_test_dataset( |
|
|
170 |
{'MIR': 'mir_dummy.tsv'}, # OMIC file of the test set. It doesnt have to be the same as for training |
|
|
171 |
'TEST_DATA_2', # Name of the test test to be used |
|
|
172 |
'survival_dummy.tsv', # Survival file of the test set |
|
|
173 |
) |
|
|
174 |
|
|
|
175 |
# Predict the labels on the test dataset |
|
|
176 |
boosting.predict_labels_on_test_dataset() |
|
|
177 |
# Compute C-index |
|
|
178 |
boosting.compute_c_indexes_for_test_dataset() |
|
|
179 |
# See cluster consistency |
|
|
180 |
boosting.compute_clusters_consistency_for_test_labels() |
|
|
181 |
``` |
|
|
182 |
|
|
|
183 |
The output folder is updated with the new output files |
|
|
184 |
|
|
|
185 |
```bash |
|
|
186 |
stacked_TestProject |
|
|
187 |
├── stacked_TestProject_features_anticorrelated_scores_per_clusters.tsv |
|
|
188 |
├── stacked_TestProject_features_scores_per_clusters.tsv |
|
|
189 |
├── stacked_TestProject_full_labels.tsv |
|
|
190 |
├── stacked_TestProject_KM_plot_boosting_full.png |
|
|
191 |
├── stacked_TestProject_proba_KM_plot_boosting_full.png |
|
|
192 |
├── stacked_TestProject_TEST_DATA_1_KM_plot_boosting_test.png |
|
|
193 |
├── stacked_TestProject_TEST_DATA_1_proba_KM_plot_boosting_test.png |
|
|
194 |
├── stacked_TestProject_TEST_DATA_1_test_labels.tsv |
|
|
195 |
├── stacked_TestProject_TEST_DATA_2_KM_plot_boosting_test.png |
|
|
196 |
├── stacked_TestProject_TEST_DATA_2_proba_KM_plot_boosting_test.png |
|
|
197 |
├── stacked_TestProject_TEST_DATA_2_test_labels.tsv |
|
|
198 |
├── stacked_TestProject_test_fold_labels.tsv |
|
|
199 |
├── stacked_TestProject_test_labels.tsv |
|
|
200 |
└── stacked_TestProject_training_set_labels.tsv |
|
|
201 |
|
|
|
202 |
``` |
|
|
203 |
file: stacked_TestProject_TEST_DATA_1_KM_plot_boosting_test.png |
|
|
204 |
|
|
|
205 |
 |
|
|
206 |
|
|
|
207 |
file: stacked_TestProject_TEST_DATA_2_KM_plot_boosting_test.png |
|
|
208 |
|
|
|
209 |
 |
|
|
210 |
|
|
|
211 |
## Distributed computation |
|
|
212 |
|
|
|
213 |
Because SimDeepBoosting constructs an ensemble of models, it is well suited to distribute the individual construction of each SimDeep instance. To do such a task, we implemented the use of the ray framework that allow DeepProg to distribute the creation of each submodel on different clusters/nodes/CPUs. The configuration of the nodes / clusters, or local CPUs to be used needs to be done when instanciating a new ray object with the ray [API](https://ray.readthedocs.io/en/latest/). It is however quite straightforward to define the number of instances launched on a local machine such as in the example below in which 3 instances are used. |
|
|
214 |
|
|
|
215 |
|
|
|
216 |
```python |
|
|
217 |
# Instanciate a ray object that will create multiple workers |
|
|
218 |
import ray |
|
|
219 |
ray.init(webui_host='0.0.0.0', num_cpus=3) |
|
|
220 |
# More options can be used (e.g. remote clusters, AWS, memory,...etc...) |
|
|
221 |
# ray can be used locally to maximize the use of CPUs on the local machine |
|
|
222 |
# See ray API: https://ray.readthedocs.io/en/latest/index.html |
|
|
223 |
|
|
|
224 |
boosting = SimDeepBoosting( |
|
|
225 |
... |
|
|
226 |
distribute=True, # Additional option to use ray cluster scheduler |
|
|
227 |
... |
|
|
228 |
) |
|
|
229 |
... |
|
|
230 |
# Processing |
|
|
231 |
... |
|
|
232 |
|
|
|
233 |
# Close clusters and free memory |
|
|
234 |
ray.shutdown() |
|
|
235 |
``` |
|
|
236 |
|
|
|
237 |
## More examples |
|
|
238 |
|
|
|
239 |
More example scripts are availables in ./examples/ which will assist you to build a model from scratch with test and real data: |
|
|
240 |
|
|
|
241 |
```bash |
|
|
242 |
examples |
|
|
243 |
├── create_autoencoder_from_scratch.py # Construct a simple deeprog model on the dummy example dataset |
|
|
244 |
├── example_with_dummy_data_distributed.py # Process the dummy example dataset using ray |
|
|
245 |
├── example_with_dummy_data.py # Process the dummy example dataset |
|
|
246 |
└── load_3_omics_model.py # Process the example HCC dataset |
|
|
247 |
|
|
|
248 |
|
|
|
249 |
``` |