Diff of /README.md [000000] .. [53737a]

Switch to unified view

a b/README.md
1
# Survival Integration of Multi-omics using Deep-Learning (DeepProg)
2
3
This package allows to combine multi-omics data together with survival. Using autoencoders, the pipeline creates new features and identify those linked with survival, using CoxPH regression.
4
The omic data used in the original study are RNA-Seq, MiR and Methylation. However, this approach can be extended to any combination of omic data.
5
6
The current package contains the omic data used in the study and a copy of the model computed. However, it is very easy to recreate a new model from scratch using any combination of omic data.
7
The omic data and the survival files should be in tsv (Tabular Separated Values) format and examples are provided. The deep-learning framework uses Keras, which is a embedding of Theano / tensorflow/ CNTK.
8
9
A more complete documentation with API description is also available at [https://deepprog-garmires-lab.readthedocs.io/en/latest/index.html/](https://deepprog-garmires-lab.readthedocs.io/en/latest/index.html)
10
11
### Documentation section
12
* [Installation](https://deepprog-garmires-lab.readthedocs.io/en/latest/installation.html)
13
* [Tutorial: Simple DeepProg model](https://deepprog-garmires-lab.readthedocs.io/en/latest/usage.html)
14
* [Tutorial: Ensemble of DeepProg model](https://deepprog-garmires-lab.readthedocs.io/en/latest/usage_ensemble.html)
15
* [Tutorial: Advanced usage of DeepProg model](https://deepprog-garmires-lab.readthedocs.io/en/latest/usage_advanced.html)
16
* [Tutorial: use DeepProg from the docker image](https://deepprog-garmires-lab.readthedocs.io/en/latest/usage_with_docker.html)
17
* [Case study: Analyzing TCGA HCC dataset](https://deepprog-garmires-lab.readthedocs.io/en/latest/case_study.html)
18
* [Tutorial: Tuning DeepProg](https://deepprog-garmires-lab.readthedocs.io/en/latest/usage_tuning.html)
19
20
21
## Requirements
22
* Python 2 or 3 (Python3 is recommended)
23
* Either theano, tensorflow or CNTK (tensorflow is recommended)
24
* [theano](http://deeplearning.net/software/theano/install.html) (the used version for the manuscript was 0.8.2)
25
* [tensorflow](https://www.tensorflow.org/) as a more robust alternative to theano
26
* [cntk](https://github.com/microsoft/CNTK) CNTK is anoter DL library that can present some advantages compared to tensorflow or theano. See [https://docs.microsoft.com/en-us/cognitive-toolkit/](https://docs.microsoft.com/en-us/cognitive-toolkit/)
27
* scikit-learn (>=0.18)
28
* numpy, scipy
29
* lifelines
30
* (if using python3) scikit-survival
31
* (For distributed computing) ray (ray >= 0.8.4) framework
32
* (For hyperparameter tuning) scikit-optimize
33
34
35
```bash
36
pip3 install tensorflow
37
38
# Alternative to tensorflow, original backend used
39
pip3 install theano
40
41
#If you want to use theano or CNTK
42
nano ~/.keras/keras.json
43
```
44
45
## Tested python package versions
46
Python 3.8 (tested for Linux and OSX. For Windows Visual C++ is required and LongPathsEnabled shoud be set to 1 in windows registry)
47
* tensorflow == 2.4.1 (2.4.1 currently doesn't seem to work with python3.9)
48
* keras == 2.4.3
49
* ray == 0.8.4
50
* scikit-learn == 0.23.2
51
* scikit-survival == 0.14.0 (currently doesn't seem to work with python3.9)
52
* lifelines == 0.25.5
53
* scikit-optimize == 0.8.1 (currently doesn't seem to work with python3.9)
54
* mpld3 == 0.5.1
55
Since ray and tensorflow are rapidly evolving libraries, newest versions might unfortunatly break DeepProg's API. To avoid any dependencies issues, we recommand working inside a Python 3 [virtual environement](https://docs.python.org/3/tutorial/venv.html) (`virtualenv`) and install the tested packages.
56
57
### installation (local)
58
59
```bash
60
# The downloading can take few minutes due to the size of th git project
61
git clone https://github.com/lanagarmire/DeepProg.git
62
cd DeepProg
63
64
# install with conda
65
conda env create -n deepprog -f ./environment.yml python=3.8
66
conda activate deepprog
67
pip install -e . -r requirements_tested.txt
68
69
# (RECOMMENDED) to install the tested python library versions
70
pip install -e . -r requirements_tested.txt
71
72
##Alternative installations
73
74
# Basic installation
75
pip3 install -e . -r requirements.txt
76
# To intall the distributed frameworks
77
pip3 install -r requirements_distributed.txt
78
# Installing scikit-survival (python3 only)
79
pip3 install -r requirements_pip3.txt
80
81
# DeepProg is working also with python2/pip2 however there is no support for scikit-survival in python2
82
pip2 install -e . -r requirements.txt
83
pip2 install -e . -r requirements_distributed.txt
84
85
# Install ALL required dependencies with the most up to date packages
86
pip install -e . -r requirements_all.txt
87
```
88
89
### Installation with docker
90
We have created a docker image (`opoirion/deepprog_docker:v1`) with all the dependencies already installed. For the docker (and singularity) instruction, please refer to the docker [tutorial](https://deepprog-garmires-lab.readthedocs.io/en/latest/usage_with_docker.html) (see above).
91
92
93
### Support for CNTK / tensorflow
94
* We originally used Keras with theano as backend plateform. However, [Tensorflow](https://www.tensorflow.org/) (currently used as default) or [CNTK](https://docs.microsoft.com/en-us/cognitive-toolkit/) are more recent DL framework that can be faster or more stable than theano. Because keras supports these 3 backends, it is possible to use them as alternative to theano. To change backend, please configure the `$HOME/.keras/keras.json` file. (See official instruction [here](https://keras.io/backend/)).
95
96
The default configuration file looks like this:
97
98
```json
99
{
100
    "image_data_format": "channels_last",
101
    "epsilon": 1e-07,
102
    "floatx": "float32",
103
    "backend": "tensorflow"
104
}
105
```
106
107
### Distributed computation
108
* It is possible to use the python ray framework [https://github.com/ray-project/ray](https://github.com/ray-project/ray) to control the parallel computation of the multiple models. To use this framework, it is required to install it: `pip install ray`
109
* Alternatively, it is also possible to create the model one by one without the need of the ray framework
110
111
### Visualisation module (Experimental)
112
* To visualise test sets projected into the multi-omic survival space, it is required to install `mpld3` module: `pip install mpld3`
113
* Note that the pip version of mpld3 installed on my computer presented a [bug](https://github.com/mpld3/mpld3/issues/434): `TypeError: array([1.]) is not JSON serializable `. However, the [newest](https://github.com/mpld3/mpld3) version of the mpld3 available from the github solved this issue. It is therefore recommended to install the newest version to avoid this issue.
114
115
116
## Usage
117
* test if simdeep is functional (all the software are correctly installed):
118
119
```bash
120
  python3 test/test_simdeep.py -v
121
122
  # Individual examples
123
  python3 python examples/example_with_dummy_data.py
124
  python3 python examples/example_with_dummy_data_distributed.py
125
  python3 python examples/example_with_precomputed_labels.py
126
  python3 python examples/example_hyperparameters_tuning.py
127
  python3 python examples/example_hyperparameters_tuning_with_test_dataset.py
128
  ```
129
130
* All the default parameters are defined in the config file: `./simdeep/config.py` but can be passed dynamically. Three types of parameters must be defined:
131
  * The training dataset (omics + survival input files)
132
    * In addition, the parameters of the test set, i.e. the omic dataset and the survival file
133
  * The parameters of the autoencoder (the default parameters works but it might be fine-tuned.
134
  * The parameters of the classification procedures (default are still good)
135
136
137
## Example datasets and scripts
138
An omic .tsv file must have this format:
139
140
```bash
141
head mir_dummy.tsv
142
143
Samples        dummy_mir_0     dummy_mir_1     dummy_mir_2     dummy_mir_3 ...
144
sample_test_0  0.469656032287  0.347987447237  0.706633335508  0.440068758445 ...
145
sample_test_1  0.0453108219657 0.0234642968791 0.593393816691  0.981872970341 ...
146
sample_test_2  0.908784043793  0.854397550009  0.575879144667  0.553333958713 ...
147
...
148
149
```
150
151
a survival file must have this format:
152
153
```bash
154
head survival_dummy.tsv
155
156
Samples        days event
157
sample_test_0  134  1
158
sample_test_1  291  0
159
sample_test_2  125  1
160
sample_test_3  43   0
161
...
162
163
```
164
165
As examples, we included two datasets:
166
* A dummy example dataset in the `example/data/` folder:
167
```bash
168
examples
169
├── data
170
│   ├── meth_dummy.tsv
171
│   ├── mir_dummy.tsv
172
│   ├── rna_dummy.tsv
173
│   ├── rna_test_dummy.tsv
174
│   ├── survival_dummy.tsv
175
│   └── survival_test_dummy.tsv
176
```
177
178
* And a real dataset in the `data` folder. This dataset derives from the TCGA HCC cancer dataset. This dataset needs to be decompressed before processing:
179
180
```bash
181
data
182
├── meth.tsv.gz
183
├── mir.tsv.gz
184
├── rna.tsv.gz
185
└── survival.tsv
186
187
```
188
189
## Creating a simple DeepProg model with one autoencoder for each omic
190
First, we will build a model using the example dataset from `./examples/data/` (These example files are set as default in the config.py file). We will use them to show how to construct a single DeepProg model inferring a autoencoder for each omic
191
192
```python
193
194
# SimDeep class can be used to build one model with one autoencoder for each omic
195
from simdeep.simdeep_analysis import SimDeep
196
from simdeep.extract_data import LoadData
197
198
help(SimDeep) # to see all the functions
199
help(LoadData) # to see all the functions related to loading datasets
200
201
# Defining training datasets
202
from simdeep.config import TRAINING_TSV
203
from simdeep.config import SURVIVAL_TSV
204
205
dataset = LoadData(training_tsv=TRAINING_TSV, survival_tsv=SURVIVAL_TSV)
206
207
simDeep = SimDeep(dataset=dataset) # instantiate the model with the dummy example training dataset defined in the config file
208
simDeep.load_training_dataset() # load the training dataset
209
simDeep.fit() # fit the model
210
211
# Defining test datasets
212
from simdeep.config import TEST_TSV
213
from simdeep.config import SURVIVAL_TSV_TEST
214
215
simDeep.load_new_test_dataset(TEST_TSV, fname_key='dummy', path_survival_file=SURVIVAL_TSV_TEST)
216
217
# The test set is a dummy rna expression (generated randomly)
218
print(simDeep.dataset.test_tsv) # Defined in the config file
219
# The data type of the test set is also defined to match an existing type
220
print(simDeep.dataset.data_type) # Defined in the config file
221
simDeep.predict_labels_on_test_dataset() # Perform the classification analysis and label the set dataset
222
223
print(simDeep.test_labels)
224
print(simDeep.test_labels_proba)
225
226
simDeep.save_encoders('dummy_encoder.h5')
227
228
```
229
## Creating a DeepProg model using an ensemble of submodels
230
231
Secondly, we will build a more complex DeepProg model constituted of an ensemble of sub-models each originated from a subset of the data. For that purpose, we need to use the `SimDeepBoosting` class:
232
233
```python
234
from simdeep.simdeep_boosting import SimDeepBoosting
235
236
help(SimDeepBoosting)
237
238
from collections import OrderedDict
239
240
241
path_data = "../examples/data/"
242
# Example tsv files
243
tsv_files = OrderedDict([
244
          ('MIR', 'mir_dummy.tsv'),
245
          ('METH', 'meth_dummy.tsv'),
246
          ('RNA', 'rna_dummy.tsv'),
247
])
248
249
# File with survival event
250
survival_tsv = 'survival_dummy.tsv'
251
252
project_name = 'stacked_TestProject'
253
epochs = 10 # Autoencoder epochs. Other hyperparameters can be fine-tuned. See the example files
254
seed = 3 # random seed used for reproducibility
255
nb_it = 5 # This is the number of models to be fitted using only a subset of the training data
256
nb_threads = 2 # These treads define the number of threads to be used to compute survival function
257
258
boosting = SimDeepBoosting(
259
    nb_threads=nb_threads,
260
    nb_it=nb_it,
261
    split_n_fold=3,
262
    survival_tsv=survival_tsv,
263
    training_tsv=tsv_files,
264
    path_data=path_data,
265
    project_name=project_name,
266
    path_results=path_data,
267
    epochs=epochs,
268
    seed=seed)
269
270
# Fit the model
271
boosting.fit()
272
# Predict and write the labels
273
boosting.predict_labels_on_full_dataset()
274
# Compute internal metrics
275
boosting.compute_clusters_consistency_for_full_labels()
276
# COmpute the feature importance
277
boosting.compute_feature_scores_per_cluster()
278
# Write the feature importance
279
boosting.write_feature_score_per_cluster()
280
281
boosting.load_new_test_dataset(
282
    {'RNA': 'rna_dummy.tsv'}, # OMIC file of the test set. It doesnt have to be the same as for training
283
    'TEST_DATA_1', # Name of the test test to be used
284
    'survival_dummy.tsv', # [OPTIONAL] Survival file of the test set
285
)
286
287
# Predict the labels on the test dataset
288
boosting.predict_labels_on_test_dataset()
289
# Compute C-index
290
boosting.compute_c_indexes_for_test_dataset()
291
# See cluster consistency
292
boosting.compute_clusters_consistency_for_test_labels()
293
294
# [EXPERIMENTAL] method to plot the test dataset amongst the class kernel densities
295
boosting.plot_supervised_kernel_for_test_sets()
296
```
297
298
## Creating a distributed DeepProg model using an ensemble of submodels
299
300
We can allow DeepProg to distribute the creation of each submodel on different clusters/nodes/CPUs by using the ray framework.
301
The configuration of the nodes / clusters, or local CPUs to be used needs to be done when instanciating a new ray object with the ray [API](https://ray.readthedocs.io/en/latest/). It is however quite straightforward to define the number of instances launched on a local machine such as in the example below in which 3 instances are used.
302
303
```python
304
# Instanciate a ray object that will create multiple workers
305
import ray
306
ray.init(webui_host='0.0.0.0', num_cpus=3)
307
# More options can be used (e.g. remote clusters, AWS, memory,...etc...)
308
# ray can be used locally to maximize the use of CPUs on the local machine
309
# See ray API: https://ray.readthedocs.io/en/latest/index.html
310
311
boosting = SimDeepBoosting(
312
    ...
313
    distribute=True, # Additional option to use ray cluster scheduler
314
    ...
315
)
316
...
317
# Processing
318
...
319
320
# Close clusters and free memory
321
ray.shutdown()
322
```
323
324
## Hyperparameter search
325
DeepProg can accept various alernative hyperparameters to fit a model, including alterative clustering,  normalisation, embedding, choice of autoencoder hyperparameters, use/restrict embedding and survival selection, size of holdout samples, ensemble model merging criterion. Furthermore it can accept external methods to perform clustering / normalisation or embedding. To help ones to find the optimal combinaisons of hyperparameter for a given dataset, we implemented an optional hyperparameter search module based on sequencial-model optimisation search and relying on the [tune](https://docs.ray.io/en/master/tune.html) and [scikit-optimize](https://scikit-optimize.github.io/stable/) python libraries. The optional hyperparameter tuning will perform a non-random itertive grid search and will select each new set of hyperparameters based on the performance of the past iterations. The computation can be entirely distributed thanks to the ray interace (see above).
326
327
```python
328
329
from simdeep.simdeep_tuning import SimDeepTuning
330
331
# AgglomerativeClustering is an external class that can be used as
332
# a clustering algorithm since it has a fit_predict method
333
from sklearn.cluster.hierarchical import AgglomerativeClustering
334
335
args_to_optimize = {
336
    'seed': [100, 200, 300, 400],
337
    'nb_clusters': [2, 3, 4, 5],
338
    'cluster_method': ['mixture', 'coxPH',
339
                       AgglomerativeClustering],
340
    'use_autoencoders': (True, False),
341
    'class_selection': ('mean', 'max'),
342
}
343
344
tuning = SimDeepTuning(
345
    args_to_optimize=args_to_optimize,
346
    nb_threads=nb_threads,
347
    survival_tsv=SURVIVAL_TSV,
348
    training_tsv=TRAINING_TSV,
349
    path_data=PATH_DATA,
350
    project_name=PROJECT_NAME,
351
    path_results=PATH_DATA,
352
)
353
354
ray.init(webui_host='0.0.0.0')
355
356
357
tuning.fit(
358
    # We will use the holdout samples Cox-PH pvalue as objective
359
    metric='log_test_fold_pvalue',
360
    num_samples=25,
361
    # Experiment run concurently using ray as dispatcher
362
    max_concurrent=2,
363
    # In addition, each deeprog model will be distributed
364
    distribute_deepprog=True,
365
    iterations=1)
366
367
# We recommend using large `max_concurrent` and distribute_deepprog=True
368
# when a large number CPUs and large RAMs are availables
369
370
# Results
371
table = tuning.get_results_table()
372
print(table)
373
```
374
375
## Save / load models
376
377
Two mechanisms exist to save and load dataset.
378
First the models can be entirely saved and loaded using `dill` (pickle like) libraries.
379
380
```python
381
from simdeep.simdeep_utils import save_model
382
from simdeep.simdeep_utils import load_model
383
384
# Save previous boosting model
385
save_model(boosting, "./test_saved_model")
386
387
# Delete previous model
388
del boosting
389
390
# Load model
391
boosting = load_model("TestProject", "./test_saved_model")
392
boosting.predict_labels_on_full_dataset()
393
394
```
395
396
See an example of saving/loading model in the example file: `examples/load_and_save_models.py`
397
398
However, this mechanism presents a huge drawback since the models saved can be very large (all the hyperparameters/matrices... etc... are saved). Also, the equivalent dependencies and DL libraries need to be installed in both the machine computing the models and the machine used to load them which can lead to various errors.
399
400
A second solution is to save only the labels inferred for each submodel instance. These label files can then be loaded into a new DeepProg instance that will be used as reference for building the classifier.
401
402
```python
403
404
# Fitting a model
405
boosting.fit()
406
# Saving individual labels
407
boosting.save_test_models_classes(
408
    path_results=PATH_PRECOMPUTED_LABELS # Where to save the labels
409
    )
410
411
boostingNew = SimDeepBoosting(
412
        survival_tsv=SURVIVAL_TSV, # Same reference training set for `boosting` model
413
        training_tsv=TRAINING_TSV, # Same reference training set for `boosting` model
414
        path_data=PATH_DATA,
415
        project_name=PROJECT_NAME,
416
        path_results=PATH_DATA,
417
        distribute=False, # Option to use ray cluster scheduler (True or False)
418
    )
419
420
boostingNew.fit_on_pretrained_label_file(
421
    labels_files_folder=PATH_PRECOMPUTED_LABELS,
422
    file_name_regex="*.tsv")
423
424
boostingNew.predict_labels_on_full_dataset()
425
```
426
427
See the `examples/example_simdeep_start_from_pretrained_labels.py` example file.
428
429
## Example scripts
430
431
Example scripts are availables in ./examples/ which will assist you to build a model from scratch with test and real data:
432
433
```bash
434
examples
435
├── example_hyperparameters_tuning.py
436
├── example_hyperparameters_tuning_with_test_dataset.py
437
├── example_with_dummy_data_distributed.py
438
├── example_with_dummy_data.py
439
└── load_3_omics_model.py
440
```
441
442
### R installation (Alternative to Python lifelines)
443
444
In his first implementation, DeepProg used the R survival toolkits to fit the survival functions (cox-PH models) and compute the concordance indexes. These functions have been replaced with the python toolkits lifelines and scikit-survival for more convenience and avoid any compatibility issue. However, differences exists regarding the computation of the c-indexes using either python or R libraries. To use the original R functions, it is necessary to install the following R libraries.
445
446
* R
447
* the R "survival" package installed.
448
* rpy2 3.4.4 (for python2 rpy2 can be install with: pip install rpy2==2.8.6, for python3 pip3 install rpy2==2.8.6).
449
450
451
```R
452
install.packages("survival")
453
install.packages("glmnet")
454
if (!requireNamespace("BiocManager", quietly = TRUE))
455
    install.packages("BiocManager")
456
BiocManager::install("survcomp")
457
```
458
459
Then, when instantiating a `SimDeep` or a `SimDeepBoosting` object, the option `use_r_packages` needs to be set to `True`.
460
461
462
463
## License
464
465
The project is licensed under the PolyForm Perimeter License 1.0.0.
466
467
(See https://polyformproject.org/licenses/)
468
469
## Citation
470
471
This package refers to our study published in Genome Biology: [Multi-omics-based pan-cancer prognosis prediction using an ensemble of deep-learning and machine-learning models](https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-021-00930-x)
472
473
## Data avaibility
474
475
The matrices and the survival data used to compute the models are available here [https://doi.org/10.6084/m9.figshare.14832813.v1](https://doi.org/10.6084/m9.figshare.14832813.v1)
476
477
## contact and credentials
478
* Developer: Olivier Poirion (PhD)
479
* contact: opoirion@hawaii.edu, o.poirion@gmail.com