Diff of /docs/usage_advanced.md [000000] .. [53737a]

Switch to unified view

a b/docs/usage_advanced.md
1
# Tutorial: Advanced usage of DeepProg model
2
3
## Visualisation
4
Once a DeepProg model is fitted, it might be interessant to obtain different visualisations of the samples for the training or the test sets, based on new survival features inferred by the autoencoders.For that purpose, we developped two methods to project the samples into a 2D space that can be called once a `SimDeepBoosting` or a `simDeep` is fitted.
5
6
```python
7
# boosting class instance fitted using the ensemble tutorial
8
boosting.plot_supervised_predicted_labels_for_test_sets()
9
```
10
11
The first method transforms the OMIC matrix activities into the new survival feature space inferred by the autoencoders and projects the samples into a 2D space using  PCA analysis. The figure creates a kernel density for each cluster and project the labels of the test set.
12
13
![kdplot 1](./img/stacked_TestProject_TEST_DATA_2_KM_plot_boosting_test_kde_2_cropped.png)
14
15
A second more sophisticated method uses the new features inferred by the autoencoders to compute new features by constructing a supervised network targetting the inferred subtype labels. The new set of features are then projected into a 2D space using PCA analysis. This second method might present more efficient visualisations of the different clusters since it is uses a supervised algorithm.
16
17
```python
18
boosting.plot_supervised_kernel_for_test_sets()
19
```
20
21
![kdplot 2](./img/stacked_TestProject_TEST_DATA_2_KM_plot_boosting_test_kde_1_cropped.png)
22
23
Note that these visualisation are not very efficient in that example dataset, since we have only a limited number of samples (40) and features. However, they might become more useful for real datasets.
24
25
26
## Hyperparameters
27
28
Hyperparameters can have a considerable influence on the accuracy of DeepProgs models. We set up the default hyperparameters to be used on a maximum of different datasets. However, specific datasets might require additional optimizations. Below, we are listing
29
30
31
### Normalisation
32
33
DeepProg uses by default a four-step normalisation for both training and test datasets:
34
1. Selection of the top 100 features according to the variances
35
2. Rank normalisation per sample
36
3. Sample-sample Correlation similarity transformation
37
4. Rank normalisation
38
39
```python
40
default_normalisation =  {
41
    'NB_FEATURES_TO_KEEP': 100,
42
    'TRAIN_RANK_NORM': True,
43
    'TRAIN_CORR_REDUCTION': True,
44
    'TRAIN_CORR_RANK_NORM': True,
45
}
46
47
boosting = SimDeepBoosting(
48
        normalization=default_normalisation
49
    )
50
```
51
52
However, it is possible to use other normalisation using external python classes that have `fit` and `fit_transform` methods.
53
54
55
```python
56
from sklearn.preprocessing import RobustScaler
57
58
custom_norm =  {
59
    'CUSTOM': RobustScaler,
60
}
61
62
boosting = SimDeepBoosting(
63
        normalization=custom_norm
64
    )
65
66
    ```
67
68
Finally, more alternative normalisations are proposed in the config file.
69
70
### Number of clusters
71
72
The parameters `nb_clusters` is used to define the number of partitions to produce
73
74
```python
75
#Example
76
boosting = SimDeepBoosting(
77
    nb_clusters=3)
78
boosting.fit()
79
```
80
81
### Clustering algorithm
82
83
By default, DeepProg is using a gaussian mixture model from the scikit-learn library  to perform clustering. The hyperparameter of the model are customisable using the `mixture_params` parameter:
84
85
```python
86
# Default params from the config file:
87
88
MIXTURE_PARAMS = {
89
    'covariance_type': 'diag',
90
    'max_iter': 1000,
91
    'n_init': 100
92
    }
93
94
boosting = SimDeepBoosting(
95
    mixture_params=MIXTURE_PARAMS,
96
    nb_clusters=3,
97
    cluster_method='mixture' # Default
98
    )
99
```
100
101
In addition to the gaussian mixture model, three alternative clustering approaches are available: a) `kmeans`, which refers to the scikit-learn KMeans class, b) `coxPH` which fits a L1 penalized multi-dimensional Cox-PH model and then dichotomize the samples into K groups using the  predicted suvival times, and c) `coxPHMixture` which fit a Mixture model on the predicted survival time from the L1 penalized Cox-PH model. The L1 penalised Cox-PH model is fitted using scikit-survival `CoxnetSurvivalAnalysis`class for python3 so it cannot be computed when using python 2. Finally, external clustering class instances can be used as long as they have a `fit_predict` method returning an array of labels, and accepting a `nb_clusters` parameter.
102
103
```python
104
# External clustering class having fit_predict method
105
from sklearn.cluster.hierarchical import AgglomerativeClustering
106
107
boostingH = SimDeepBoosting(
108
        nb_clusters=3,
109
        cluster_method=AgglomerativeClustering # Default
110
    )
111
112
113
class DummyClustering:
114
    self __init__(self, nb_clusters):
115
        """ """
116
        self.nb_clusters
117
118
    def fit_predict(M):
119
        """ """
120
        import numpy as np
121
        return np.random.randint(0, self.nb_clusters, M.shape[0])
122
123
124
boostingDummy = SimDeepBoosting(
125
        nb_clusters=3,
126
        cluster_method=DummyClustering # Default
127
    )
128
```
129
130
### Choice of specific OMIC for clustering
131
132
Not all the OMIC type needs to be used for clustering (Although they are by default). The option `clustering_omics` controls which omics are used for clustering if not all. However, all the input OMICS will be used to classify new samples using the intersecting features across all the omics.
133
134
```python
135
boosting = SimDeepBoosting(
136
    ...
137
    clustering_omics=['RNA', 'MIR'], # Only 'RNA' and 'MIR' will be used for clustering the samples
138
    ...
139
    )
140
141
```
142
143
### Embedding and survival features selection
144
after  each omic matrix is normalised, DeepProg transforms each feature matrix using by default an autoencoder network as embedding algorithm and then select the transformed features linked to survival using univariate Cox-PH models. Alternatively, DeepProg can accept any external embedding algorithm having a `fit` and transform `method`, following the scikit-learn nomenclature. For instance, `PCA` and `fastICA` classes of the scikit-learn package can be used as replacement for the autoencoder.
145
146
```python
147
# Example using PCA as alternative embedding.
148
149
150
from scklearn.decomposition import PCA
151
152
153
boosting = SimDeepBoosting(
154
        nb_clusters=3,
155
        alternative_embedding=PCA,
156
    )
157
158
```
159
160
Another example is the use of the MAUI multi-omic method instead of the autoencoder
161
162
```python
163
class MauiFitting():
164
165
    def __init__(self, **kwargs):
166
        """
167
        """
168
        self._kwargs = kwargs
169
        self.model = Maui(**kwargs)
170
171
172
    def fit(self, matrix):
173
        """ """
174
        self.model.fit({'cat': pd.DataFrame(matrix).T})
175
176
    def transform(self, matrix):
177
        """ """
178
        res = self.model.transform({'cat': pd.DataFrame(matrix).T})
179
180
        return np.asarray(res)
181
182
    boosting = SimDeepBoosting(
183
        nb_clusters=3,
184
        alternative_embedding=MauiFitting,
185
        ...
186
    )
187
188
```
189
190
After the embedding step, DeepProg is computing by default the individual feature contribution toward survival using univariate Cox-PH model (`feature_selection_usage='individual'`). Alternatively, DeepProg can select features linked to survival using a l1-penalized multivariate Cox-PH model (`feature_selection_usage={'individual', 'lasso'}`). Finally if the option `feature_surv_analysis` is parsed as False, DeepProg will skip the survival feature selection step.
191
192
```python
193
# Example using l1-penalized Cox-PH for selecting new survival features.
194
195
196
from scklearn.decomposition import PCA
197
198
199
boosting = SimDeepBoosting(
200
        nb_clusters=3,
201
        feature_selection_usage='individual'lasso',
202
        # feature_surv_analysis=False # Not using feature selection step
203
        ...
204
    )
205
206
```
207
208
### Number of models and random splitting seed
209
A DeepProg model is constructed using an ensemble of submodels following the [Bagging](https://en.wikipedia.org/wiki/Ensemble_learning#Bootstrap_aggregating_(bagging)) methodology. Each sub-model is created from a random split of the input dataset. Three parameters control the creation of the random splits:
210
* `-nb_it <int>` which defines the number of sub-models to create
211
* and `-split_n_fold ` which controls how the dataset will be splitted for each submodel. If `-split_n_fold=2`, the input dataset will be splitted in 2 using the `KFold` class instance from [sciki-learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) and the training /test set size ratio will be 0.5. If  `-split_n_fold=3` the training /test set size ratio will be 3 / 2 and so on.
212
* `The -seed` parameter ensures to obtain the same random splitting for `split_n_fold` and `nb_it` constant for different DeepProg instances. Different seed values can produce different performances since it creates different training datasets and is especially true when using low `nb_it` (below 50). Unfortunalley, using large `nb_it` such as 100 can be very computationally intensive, especially when tuning the models with other hyperparameters. However, tuning the model with small `nb_it` is also OK to achieve good to optimal performances (see next section).
213
214
## Usage of metadata associated with patients
215
216
DeepProg can accept an additional metadata file characterizing the individual sample (patient). These metdata can optionally be used as covariates when constructing the DeepProg models or inferring the features associated with each inferred subtypes. The metadata file should be a samples x features table with the first line as header with variable name and the first column the sample IDs. Also, the metadata file can be used to filter a subset of samples.
217
218
```bash
219
# See the example metadata table from the file: examples/data/metadata_dummy.tsv:
220
221
head examples/data/metadata_dummy.tsv
222
223
barcode sex     stage
224
sample_test_0   M       I
225
sample_test_1   M       I
226
sample_test_2   M       I
227
sample_test_3   M       I
228
sample_test_4   M       I
229
sample_test_5   M       I
230
```
231
232
Each of the column features containing only numeric values will be scaled using the sklearn `RobustScaler` method. Each of the column having string values will be one-hot encoded using all the possible values of the given feature and stacked together.
233
234
The metadata file and the metadata usage should be configured at the instantiation of a new `DeepProg` instance.
235
236
```python
237
    # metadata file
238
    OPTIONAL_METADATA = 'examples/data/metadata_dummy.tsv'
239
    # dictionary used to filter samples based on their metadata values
240
    # Multiple fields can be used
241
    SUBSET_TRAINING_WITH_META = {'stage': ['I', 'II', 'III']}
242
243
    boosting = SimDeepBoosting(
244
        survival_tsv=SURVIVAL_TSV,
245
        training_tsv=TRAINING_TSV,
246
        metadata_tsv=OPTIONAL_METADATA,
247
        metadata_usage='all',
248
        subset_training_with_meta=SUBSET_TRAINING_WITH_META,
249
        ...
250
        )
251
```
252
253
`metadata_usage` can have different values:
254
* `None` or `False`: the metadata will not be used for constructing DeepProg models or computing significant features
255
* `'"labels"'`: The metadata matrix will only be used as covariates when inferring the survival models from the infered clustering labels.
256
* `"new-features"`: The metadata matrix will only be used as covariates when computing the survival models to infer new features linked to survival
257
* `"test-labels"`: The metadata matrix will only be used as covariates when inferring the survival models from the labels obtained for the test datasets
258
* `"all"`, `True`: use the metadata matrix for all the usages described above.
259
260
## Computing cluster-specific feature signatures
261
262
Once a DeepProg model is fitted, two functions can be used to infer the features signature of each subtype:
263
* `compute_feature_scores_per_cluster`: Perform a mann-Withney test between the expression of each feature within and without the subtype
264
* `compute_survival_feature_scores_per_cluster`: This function computes the Log-rank p-value after fitting an individual Cox-PH model for each of the significant features inferred by `compute_feature_scores_per_cluster`.
265
266
## R installation (Alternative to Python lifelines)
267
268
In his first implementation, DeepProg used the R survival toolkits to fit the survival functions (cox-PH models) and compute the concordance indexes. These functions have been replaced with the python toolkits lifelines and scikit-survival for more convenience and avoid any compatibility issue. However, differences exists regarding the computation of the c-indexes using either python or R libraries. To use the original R functions, it is necessary to install the following R libraries.
269
270
* R
271
* the R "survival" package installed.
272
* rpy2 3.4.4 (for python2 rpy2 can be install with: pip install rpy2==2.8.6, for python3 pip3 install rpy2==2.8.6).
273
274
275
```R
276
install.packages("survival")
277
install.packages("glmnet")
278
if (!requireNamespace("BiocManager", quietly = TRUE))
279
    install.packages("BiocManager")
280
BiocManager::install("survcomp")
281
```
282
283
Then, when instantiating a `SimDeep` or a `SimDeepBoosting` object, the option `use_r_packages` needs to be set to `True`.
284
285
```python
286
    boosting = SimDeepBoosting(
287
    ...
288
        use_r_packages=True,
289
    ...
290
)
291
```
292
293
294
295
## Save / load models
296
297
### Save /load the entire model
298
Despite dealing with very voluminous data files, Two mechanisms exist to save and load dataset.
299
First the models can be entirely saved and loaded using `dill` (pickle like) libraries.
300
301
```python
302
from simdeep.simdeep_utils import save_model
303
from simdeep.simdeep_utils import load_model
304
305
# Save previous boosting model
306
save_model(boosting, "./test_saved_model")
307
308
# Delete previous model
309
del boosting
310
311
# Load model
312
boosting = load_model("TestProject", "./test_saved_model")
313
boosting.predict_labels_on_full_dataset()
314
315
```
316
317
See an example of saving/loading model in the example file: `load_and_save_models.py`
318
319
### Save / load models from precomputed sample labels
320
321
However, this mechanism presents a huge drawback since the models saved can be very large (all the hyperparameters/matrices... etc... are saved). Also, the equivalent dependencies and DL libraries need to be installed in both the machine computing the models and the machine used to load them which can lead to various errors.
322
323
A second solution is to save only the labels inferred for each submodel instance. These label files can then be loaded into a new DeepProg instance that will be used as reference for building the classifier.
324
325
```python
326
327
# Fitting a model
328
boosting.fit()
329
# Saving individual labels
330
boosting.save_test_models_classes(
331
    path_results=PATH_PRECOMPUTED_LABELS # Where to save the labels
332
    )
333
334
boostingNew = SimDeepBoosting(
335
        survival_tsv=SURVIVAL_TSV, # Same reference training set for `boosting` model
336
        training_tsv=TRAINING_TSV, # Same reference training set for `boosting` model
337
        path_data=PATH_DATA,
338
        project_name=PROJECT_NAME,
339
        path_results=PATH_DATA,
340
        distribute=False, # Option to use ray cluster scheduler (True or False)
341
    )
342
343
boostingNew.fit_on_pretrained_label_file(
344
    labels_files_folder=PATH_PRECOMPUTED_LABELS,
345
    file_name_regex="*.tsv")
346
347
boostingNew.predict_labels_on_full_dataset()
348
```