a b/README.md
1
2
# BioDiscML
3
4
Large-scale automatic feature selection for biomarker discovery in high-dimensional 
5
OMICs data
6
7
## Short description
8
Automates the execution of many machine learning algorithms across various
9
optimization and evaluation procedures to identify the best model and signature
10
11
## Description
12
The identification of biomarker signatures in omics molecular profiling is an 
13
important challenge to predict outcomes in precision medicine context, such as 
14
patient disease susceptibility, diagnosis, prognosis and treatment response. To 
15
identify these signatures we present BioDiscML (Biomarker Discovery by Machine 
16
Learning), a tool that automates the analysis of complex biological datasets 
17
using machine learning methods. From a collection of samples and their associated 
18
characteristics, i.e. the biomarkers (e.g. gene expression, protein levels, 
19
clinico-pathological data), the goal of BioDiscML is to produce a minimal subset 
20
of biomarkers and a model that will predict efficiently a specified outcome. To 
21
this purpose, BioDiscML uses a large variety of machine learning algorithms to 
22
select the best combination of biomarkers for predicting  either categorical or 
23
continuous outcome from highly unbalanced datasets. Finally, BioDiscML also 
24
retrieves correlated biomarkers not included in the final model to better 
25
understand the signature. The software has been implemented to automate all 
26
machine learning steps, including data pre-processing, feature selection, model 
27
selection, and performance evaluation. 
28
https://github.com/mickaelleclercq/BioDiscML/
29
30
See also BioDiscViz (https://gitlab.com/SBouirdene/biodiscviz.git), which 
31
includes consensus feature search, to visualize your results.
32
33
Full manuscript: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6532608/
34
35
## Requirements
36
JAVA 8 (https://www.java.com/en/download/)
37
38
## Program usage
39
BioDiscML can be started either by creating a config file or using command line
40
41
### By config file
42
Before executing BioDiscML, a config file must be created. Use the template to 
43
create your own. Everything is detailed in the config.conf file. Examples are 
44
available in the Test_datasets at: 
45
https://github.com/mickaelleclercq/BioDiscML/tree/master/release/Test_datasets
46
47
#### Train a new model
48
```Bash
49
java -jar biodiscml.jar -config config.conf -train
50
```
51
config_myData.conf (text file) quick start content example (See release/Test_datasets folder)
52
This configuration takes as input a file (myData.csv) and name it myProjectName.
53
A sampling (default 2/3 for training and 1/3 for testing) is performed before 
54
classification procedure to predict the myOutcome class. One best model will 
55
be selected based on Repated Holdout performance MCC on the train set. 
56
config.conf file example: 
57
```
58
project=myProjectName
59
trainFile=myData.csv
60
sampling=true
61
doClassification=true
62
classificationClassName=myOutcome
63
numberOfBestModels=1
64
numberOfBestModelsSortingMetric=TRAIN_TEST_RH_MCC
65
```
66
67
#### Resume an execution
68
Just add -resumeTraining=true in the command
69
```Bash
70
java -jar biodiscml.jar -config config.conf -train -resumeTraining=true
71
```
72
73
#### Choose best model(s)
74
```Bash
75
java -jar biodiscml.jar -config config.conf -bestmodel 
76
```
77
When training completed, stopped or still in execution, best model 
78
selection can be executed. This command reads the results file. Best 
79
models are selected based on a strategy provided in config file. You 
80
can also choose your own models manually, by opening the results file 
81
in an excel-like program and order models by your favorite metrics or 
82
filters. Each model has an identifier (modelID) you can provide to 
83
the command. Example:
84
```Bash
85
java -jar biodiscml.jar -config config.conf -bestmodel modelID_1 modelID_2
86
```
87
88
#### Predict new data
89
```Bash
90
java -jar biodiscml.jar -config config.conf -predict 
91
```
92
Once the best model obtained, you can predict new data or test a blind test set 
93
put aside by yourself before training. The file should be of same format and structure 
94
as the training input files. This file must contain at least all features of the 
95
selected best model signature. Features present in the newData file but absent from 
96
the signature of the model will simply be ignored during the prediction. If a 
97
class to predict column is present, BioDiscML will return errors statistics.
98
config.conf file example: 
99
```
100
project=myProjectName
101
newDataFile=myNewData.csv
102
doClassification=true
103
classificationClassName=class
104
modelFile=myBestModel.model
105
```
106
107
### By command line
108
The same parameters from config file can be used to be inputed in a command line.
109
Example: 
110
```Bash
111
time java -jar biodiscml.jar -train -project=myProject -excluded=excludedColumn 
112
-doClassification=true -classificationClassName=class -trainFile=data.csv 
113
-debug=true -bootstrapFolds=10 -loocv=false -cpus=10 -computeBestModel=false 
114
-classificationFastWay=true -ccmd=bayes.AveragedNDependenceEstimators.A1DE -F 1 -M 1.0 -W
115
```
116
Note that the option -ccmd must stay at the end of the command line when classifier
117
parameters follows it. 
118
119
120
## Output files
121
Note: {project_name} is set in the config.conf file
122
123
- {project_name}_a.*  
124
125
A csv file and a copy in arff format (weka input format) are created here. 
126
They contain the merged data of input files with some adaptations.
127
128
- {project_name}_b.*
129
130
A csv file and a copy in arff format (weka input format) are also created here. 
131
They are produced after feature ranking and are already a subset of 
132
{project_name}_a.*. Feature ranking is performed by Information gain 
133
for categorial class. Features having infogain <0.0001 are discarded.
134
For numerical class, RELIEFF is used. Only best 1000 features are kept, 
135
or having a score greater than 0.0001.
136
137
- {project_name}_c.*results.csv
138
139
Results file. Summary of all trained model with their evaluation metrics 
140
and selected attributes. Use the bestmodel command to extract models.
141
Column index of selected attributes column correspond to the 
142
{project_name}_b.*csv file. 
143
For each model, we perform various evaluations summarized in this table:
144
145
| Header | Description |
146
| ------------ | ------------ |
147
| ID | Model unique identifier. Can be passed as argument for best model selection |
148
| Classifier | Machine learning classifier name |
149
| Options | Classifier hyperparameters options |
150
| OptimizedValue | Optimized criterion used for feature selection procedure |
151
| SearchMode | Type of feature selection procedure: <br>- Forward Stepwise Selection (F)<br>- Backward stepwise selection (B)<br>- Forward stepwise selection and Backward stepwise elimination (FB)<br>- Backward stepwise selection and Forward stepwise elimination (BF)<br>- "top k" features.|
152
| nbrOfFeatures | Number of features in the signature |
153
| TRAIN_10CV_ACC | 10 fold cross validation Accuracy on train set |
154
| TRAIN_10CV_AUC | 10 fold cross validation Area Under The Curve on train set|
155
| TRAIN_10CV_AUPRC |10 fold cross validation Area Under Precision Recall Curve on train set |
156
| TRAIN_10CV_SEN | 10 fold cross validation Sensitivity on train set|
157
| TRAIN_10CV_SPE |10 fold cross validation Specificity on train set |
158
| TRAIN_10CV_MCC |10 fold cross validation Matthews Correlation Coefficient on train set |
159
| TRAIN_10CV_MAE |10 fold cross validation Mean Absolute Error on train set |
160
| TRAIN_10CV_BER | 10 fold cross validation Balanced Error Rate on train set|
161
| TRAIN_10CV_FPR |10 fold cross validation False Positive Rate on train set |
162
| TRAIN_10CV_FNR |10 fold cross validation False Negative Rate on train set |
163
| TRAIN_10CV_PPV | 10 fold cross validation Positive Predictive value on train set|
164
| TRAIN_10CV_FDR | 10 fold cross validation False Discovery Rate on train set|
165
| TRAIN_10CV_Fscore | 10 fold cross validation F-score on train set|
166
| TRAIN_10CV_kappa | 10 fold cross validation Kappa on train set|
167
| TRAIN_matrix | 10 fold cross validation Matrix on train set |
168
| TRAIN_LOOCV_ACC | Leave-One-Out Cross Validation Accuracy on Train set |
169
| TRAIN_LOOCV_AUC | Leave-One-Out Cross Validation Area Under The Curve on Train set|
170
| TRAIN_LOOCV_AUPRC | Leave-One-Out Cross Validation Area Under Precision Recall Curve on Train set|
171
| TRAIN_LOOCV_SEN | Leave-One-Out Cross Validation Sensitivity on Train set|
172
| TRAIN_LOOCV_SPE | Leave-One-Out Cross Validation Specificity on Train set |
173
| TRAIN_LOOCV_MCC | Leave-One-Out Cross Validation Matthews Correlation Coefficient on Train set|
174
| TRAIN_LOOCV_MAE | Leave-One-Out Cross Validation Mean Absolute Error on Train set|
175
| TRAIN_LOOCV_BER | Leave-One-Out Cross Validation Balanced Error Rate on Train set|
176
| TRAIN_RH_ACC | Repeated holdout Accuracy on Train set |
177
| TRAIN_RH_AUC | Repeated holdout Area Under The Curve on Train set |
178
| TRAIN_RH_AUPRC | Repeated holdout Area Under Precision Recall Curve on Train set |
179
| TRAIN_RH_SEN | Repeated holdout Sensitivity on Train set |
180
| TRAIN_RH_SPE | Repeated holdout Specificity on Train set |
181
| TRAIN_RH_MCC | Repeated holdout Matthews Correlation Coefficient on Train set |
182
| TRAIN_RH_MAE | Repeated holdout Mean Absolute Error on Train set |
183
| TRAIN_RH_BER | Repeated holdout Balanced Error Rate on Train set |
184
| TRAIN_BS_ACC | Bootstrap Accuracy on Train set |
185
| TRAIN_BS_AUC | Bootstrap Area Under The Curve on Train set |
186
| TRAIN_BS_AUPRC | Bootstrap Area Under Precision Recall Curve on Train set |
187
| TRAIN_BS_SEN | Bootstrap Sensitivity on Train set |
188
| TRAIN_BS_SPE | Bootstrap Specificity on Train set |
189
| TRAIN_BS_MCC | Bootstrap Matthews Correlation Coefficient on Train set |
190
| TRAIN_BS_MAE | Bootstrap Mean Absolute Error on Train set |
191
| TRAIN_BS_BER | Bootstrap Balanced Error Rate on Train set |
192
| TRAIN_BS.632+ | Bootstrap .632+ rule |
193
| TEST_ACC | Evaluation Accuracy on test set|
194
| TEST_AUC | Evaluation Area Under The Curve on test set|
195
| TEST_AUPRC |Evaluation Area Under Precision Recall Curve on test set |
196
| TEST_SEN |Evaluation Sensitivity on test set |
197
| TEST_SPE | Evaluation Specificity on test set|
198
| TEST_MCC | Evaluation Matthews Correlation Coefficient on test set|
199
| TEST_MAE | Evaluation Mean Absolute Error on test set|
200
| TEST_BER | Evaluation Balanced Error Rate on test set|
201
| TRAIN_TEST_RH_ACC |Repeated holdout Accuracy on merged Train and Test sets |
202
| TRAIN_TEST_RH_AUC | Repeated holdout  Area Under The Curve on merged Train and Test sets|
203
| TRAIN_TEST_RH_AUPRC | Repeated holdout Area Under Precision Recall Curve on merged Train and Test sets|
204
| TRAIN_TEST_RH_SEN | Repeated holdout Sensitivity on merged Train and Test sets|
205
| TRAIN_TEST_RH_SPE | Repeated holdout Specificity on merged Train and Test sets|
206
| TRAIN_TEST_RH_MCC | Repeated holdout Matthews Correlation Coefficient on merged Train and Test sets|
207
| TRAIN_TEST_RH_MAE | Repeated holdout Mean Absolute Error on merged Train and Test sets|
208
| TRAIN_TEST_RH_BER | Repeated holdout Balanced Error Rate on merged Train and Test sets|
209
| TRAIN_TEST_BS_ACC |Bootstrap Accuracy on merged Train and Test sets |
210
| TRAIN_TEST_BS_AUC | Bootstrap  Area Under The Curve on merged Train and Test sets|
211
| TRAIN_TEST_BS_AUPRC | Bootstrap Area Under Precision Recall Curve on merged Train and Test sets|
212
| TRAIN_TEST_BS_SEN | Bootstrap Sensitivity on merged Train and Test sets|
213
| TRAIN_TEST_BS_SPE | Bootstrap Specificity on merged Train and Test sets|
214
| TRAIN_TEST_BS_MCC | Bootstrap Matthews Correlation Coefficient on merged Train and Test sets|
215
| TRAIN_TEST_BS_MAE | Bootstrap Mean Absolute Error on merged Train and Test sets|
216
| TRAIN_TEST_BS_BER | Bootstrap Balanced Error Rate on merged Train and Test sets|
217
| TRAIN_TEST_BS_BER_BS.632+ | Bootstrap .632+ rule on merged Train and Test sets|
218
| AVG_BER | Average of all calculated Balanced Error Rates |
219
| STD_BER | Standard deviation of the calculated Balanced Error Rates|
220
| AVG_MAE | Average of all calculated Mean Absolute Errors |
221
| STD_MAE |Standard deviation of the calculated Mean Absolute Errors |
222
| AVG_MCC | Average of all calculated Matthews Correlation Coefficients |
223
| STD_MCC |Standard deviation of the calculated Matthews Correlation Coefficients |
224
| AttributeList | Selected features. Use the option -bestmodel to generate a report and get the features' full names|
225
226
Note that all columns refering to a test set will be empty if no test set have been generated or provided
227
228
- {project_name}_d.{model_name}_{model_hyperparameters}_{feature_search_mode}.*details.txt
229
230
Detailled information about the model and its performance, with the full signature and 
231
correlated features.
232
233
234
- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*features.csv
235
236
Features retained by the model in csv.
237
If a test set have been generated or provided, a file will be generated for:
238
-- the train set (*.train_features.csv)
239
-- both train and test sets (*all_features.csv)
240
241
242
- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*corrFeatures.csv
243
244
Features retained by the model with their correlated features in csv
245
If a test set have been generated or provided, a file will be generated for:
246
-- the train set (*.train_corrFeatures.csv)
247
-- both train and test sets (*all_corrfeatures.csv)
248
249
- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*roc.png
250
251
Boostrap roc curves (EXPERIMENTAL)
252
Must be enabled in configuration file. 
253
If a test set have been generated or provided, a roc curve picture will be generated
254
for both train and test sets.
255
256
257
- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*model
258
259
Serialized model compatible with weka
260