Diff of /docs/usage_tuning.md [000000] .. [53737a]

Switch to unified view

a b/docs/usage_tuning.md
1
# Tutorial: Tuning DeepProg
2
3
DeepProg can accept various alernative hyperparameters to fit a model, including alterative clustering,  normalisation, embedding, choice of autoencoder hyperparameters, use/restrict embedding and survival selection, size of holdout samples, ensemble model merging criterion. Furthermore it can accept external methods to perform clustering / normalisation or embedding. To help ones to find the optimal combinaisons of hyperparameter for a given dataset, we implemented an optional hyperparameter search module based on sequencial-model optimisation search and relying on the [tune](https://docs.ray.io/en/master/tune.html) and [scikit-optimize](https://scikit-optimize.github.io/stable/) python libraries. The optional hyperparameter tuning will perform a non-random itertive grid search and will select each new set of hyperparameters based on the performance of the past iterations. The computation can be entirely distributed thanks to the ray interace (see above).
4
5
A DeepProg instance depends on a lot of hyperparameters. Most important hyperparameters to tune are:
6
7
* The combination of `-nb_it` (Number of sumbmodels), `-split_n_fold `(How each submodel is randomly constructed) and `-seed` (random seed).
8
* The number of clusters `-nb_clusters`
9
* The clustering algorithm (implemented: `kmeans`, `mixture`, `coxPH`, `coxPHMixture`)
10
* The preprocessing normalization (`-normalization` option, see `Tutorial: Advanced usage of DeepProg model`)
11
* The embedding used (`alternative_embedding` option)
12
* The way of creating the new survival features (`-feature_selection_usage` option)
13
14
15
## A first example
16
17
A first example of tuning is available in the [example](../../../examples/example_hyperparameters_tuning.py) folder (example_hyperparameters_tuning.py). The first part of the script defines the array of hyperparameters to screen. An instance of `SimdeepTuning` is created in which the output folder and the project name are defined.
18
19
```python
20
21
from simdeep.simdeep_tuning import SimDeepTuning
22
23
# AgglomerativeClustering is an external class that can be used as
24
# a clustering algorithm since it has a fit_predict method
25
from sklearn.cluster.hierarchical import AgglomerativeClustering
26
27
# Array of hyperparameters
28
args_to_optimize = {
29
    'seed': [100, 200, 300, 400],
30
    'nb_clusters': [2, 3, 4, 5],
31
    'cluster_method': ['mixture', 'coxPH', 'coxPHMixture',
32
                       AgglomerativeClustering],
33
    'use_autoencoders': (True, False),
34
    'class_selection': ('mean', 'max'),
35
}
36
37
tuning = SimDeepTuning(
38
    args_to_optimize=args_to_optimize,
39
    nb_threads=nb_threads,
40
    survival_tsv=SURVIVAL_TSV,
41
    training_tsv=TRAINING_TSV,
42
    path_data=PATH_DATA,
43
    project_name=PROJECT_NAME,
44
    path_results=PATH_DATA,
45
)
46
47
```
48
49
50
The SimDeepTuning module requires the use of the `ray` and `tune` python modules.
51
52
```python
53
54
ray.init(webui_host='0.0.0.0', )
55
56
```
57
58
59
### SimDeepTuning hyperparameters
60
* `num_samples` is the number of experiements
61
* `distribute_deepprog` is used to further distribute each DeepProg instance into the ray framework. If set to True, be sure to either have a large number of CPUs to use and/or to use a small number of `max_concurrent` (which is the number of concurrent experiments run in parallel). `iterations` is the number of iterations to run for each experiment (results will be averaged).
62
63
64
DeepProg can be tuned using different objective metrics:
65
* `"log_test_fold_pvalue"`: uses the stacked *out of bags* samples (survival and labels) to predict the -log10(log-rank Cox-PH pvalue)
66
* `"log_full_pvalue"`: minimizes the Cox-PH log-rank pvalue of the model (This metric can lead to overfitting since it relies on all the samples included in the model)
67
* `"test_fold_cindex"`: Maximizes the mean c-index of the test folds.
68
* `"cluster_consistency"`: Maximizes the adjusted Rand scores computed for all the model pairs. (Higher values imply stable clusters)
69
70
71
```python
72
tuning.fit(
73
    # We will use the holdout samples Cox-PH pvalue as objective
74
    metric='log_test_fold_pvalue',
75
    num_samples=35,
76
    # Experiment run concurently using ray as dispatcher
77
    max_concurrent=2,
78
    # In addition, each deeprog model will be distributed
79
    distribute_deepprog=True,
80
    iterations=1)
81
82
# We recommend using large `max_concurrent` and distribute_deepprog=True
83
# when a large number CPUs and large RAMs are availables
84
85
# Results
86
table = tuning.get_results_table()
87
print(table)
88
```
89
90
## Tuning using one or multiple test datasets
91
92
The computation of labels and the associate metrics from external test datasets can be included in the tuning workflowand be used as objective metrics. Please refers to [example](../../../examples/example_hyperparameters_tuning_with_dataset.py) folder (see example_hyperparameters_tuning_with_dataset.py).
93
94
Let's define two dummy test datasets:
95
96
```python
97
98
    # We will use the methylation and the RNA value as test datasets
99
    test_datasets = {
100
        'testdataset1': ({'METH': 'meth_dummy.tsv'}, 'survival_dummy.tsv')
101
        'testdataset2': ({RNA: rna_dummy.tsv'}, 'survival_dummy.tsv')
102
    }
103
```
104
105
We then include these two datasets when instanciating the `SimDeepTuning` instance:
106
107
```python
108
    tuning = SimDeepTuning(
109
        args_to_optimize=args_to_optimize,
110
        test_datasets=test_datasets,
111
        survival_tsv=SURVIVAL_TSV,
112
        training_tsv=TRAINING_TSV,
113
        path_data=PATH_DATA,
114
        project_name=PROJECT_NAME,
115
        path_results=PATH_DATA,
116
    )
117
```
118
119
and Finally fits the model using a objective metric accounting for the test datasets:
120
121
* `"log_test_pval"` maximizes the sum of the -log10(log-rank Cox-PH pvalue) for each test dataset
122
* `"test_cindex"` maximizes the mean on the test C-indexes
123
* `"sum_log_pval"` maximizes the sum of the model -log10(log-rank Cox-PH pvalue) with all the test datasets p-value
124
* `"mix_score"`: maximizes the product of `"sum_log_pval"`, `"cluster_consistency"`, `"test_fold_cindex"`
125
126
127
```python
128
    tuning.fit(
129
        metric='log_test_pval',
130
        num_samples=10,
131
        distribute_deepprog=True,
132
        max_concurrent=2,
133
        # iterations is usefull to take into account the DL parameter fitting variations
134
        iterations=1,
135
    )
136
137
    table = tuning.get_results_table()
138
    tuning.save_results_table()
139
```
140
141
## Results
142
143
The results will be generated in the `path_results` folder and one results folder per experiement will be generated. The report of all the experiements and metrics will be written in the result tables generated in the `path_results` folder. Once a model achieve satisfactory performance, it is possible to directly use the model by loading the generated labels with the `fit_on_pretrained_label_file` API (see the section `Save / load models from precomputed sample labels`)
144
145
## Recommendation
146
147
* According to the number of the size N of the hyperparameter array(e.g. the number of combination ), it is recommanded to perform at least more than sqrt(N) experiment but a higher N will always allow to explore a higher hyperparameter space and increase the performance.
148
* `seed` is definitively a hyperparameter to screen, especially for small number of models `nb_its` (less than 50). It is recommanded to at least screen for 8-10 different seed when using `nb_it` < 20
149
* Please, test you configuration using a small `num_samples` first