|
a |
|
b/release/readme.md |
|
|
1 |
|
|
|
2 |
# BioDiscML |
|
|
3 |
|
|
|
4 |
Large-scale automatic feature selection for biomarker discovery in high-dimensional |
|
|
5 |
OMICs data |
|
|
6 |
|
|
|
7 |
## Short description |
|
|
8 |
Automates the execution of many machine learning algorithms across various |
|
|
9 |
optimization and evaluation procedures to identify the best model and signature |
|
|
10 |
|
|
|
11 |
## Description |
|
|
12 |
The identification of biomarker signatures in omics molecular profiling is an |
|
|
13 |
important challenge to predict outcomes in precision medicine context, such as |
|
|
14 |
patient disease susceptibility, diagnosis, prognosis and treatment response. To |
|
|
15 |
identify these signatures we present BioDiscML (Biomarker Discovery by Machine |
|
|
16 |
Learning), a tool that automates the analysis of complex biological datasets |
|
|
17 |
using machine learning methods. From a collection of samples and their associated |
|
|
18 |
characteristics, i.e. the biomarkers (e.g. gene expression, protein levels, |
|
|
19 |
clinico-pathological data), the goal of BioDiscML is to produce a minimal subset |
|
|
20 |
of biomarkers and a model that will predict efficiently a specified outcome. To |
|
|
21 |
this purpose, BioDiscML uses a large variety of machine learning algorithms to |
|
|
22 |
select the best combination of biomarkers for predicting either categorical or |
|
|
23 |
continuous outcome from highly unbalanced datasets. Finally, BioDiscML also |
|
|
24 |
retrieves correlated biomarkers not included in the final model to better |
|
|
25 |
understand the signature. The software has been implemented to automate all |
|
|
26 |
machine learning steps, including data pre-processing, feature selection, model |
|
|
27 |
selection, and performance evaluation. |
|
|
28 |
https://github.com/mickaelleclercq/BioDiscML/ |
|
|
29 |
|
|
|
30 |
See also BioDiscViz (https://gitlab.com/SBouirdene/biodiscviz.git), which |
|
|
31 |
includes consensus feature search, to visualize your results. |
|
|
32 |
|
|
|
33 |
Full manuscript: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6532608/ |
|
|
34 |
|
|
|
35 |
## Requirements |
|
|
36 |
JAVA 8 (https://www.java.com/en/download/) |
|
|
37 |
|
|
|
38 |
## Program usage |
|
|
39 |
BioDiscML can be started either by creating a config file or using command line |
|
|
40 |
|
|
|
41 |
### By config file |
|
|
42 |
Before executing BioDiscML, a config file must be created. Use the template to |
|
|
43 |
create your own. Everything is detailed in the config.conf file. Examples are |
|
|
44 |
available in the Test_datasets at: |
|
|
45 |
https://github.com/mickaelleclercq/BioDiscML/tree/master/release/Test_datasets |
|
|
46 |
|
|
|
47 |
#### Train a new model |
|
|
48 |
```Bash |
|
|
49 |
java -jar biodiscml.jar -config config.conf -train |
|
|
50 |
``` |
|
|
51 |
config_myData.conf (text file) quick start content example (See release/Test_datasets folder) |
|
|
52 |
This configuration takes as input a file (myData.csv) and name it myProjectName. |
|
|
53 |
A sampling (default 2/3 for training and 1/3 for testing) is performed before |
|
|
54 |
classification procedure to predict the myOutcome class. One best model will |
|
|
55 |
be selected based on Repated Holdout performance MCC on the train set. |
|
|
56 |
config.conf file example: |
|
|
57 |
``` |
|
|
58 |
project=myProjectName |
|
|
59 |
trainFile=myData.csv |
|
|
60 |
sampling=true |
|
|
61 |
doClassification=true |
|
|
62 |
classificationClassName=myOutcome |
|
|
63 |
numberOfBestModels=1 |
|
|
64 |
numberOfBestModelsSortingMetric=TRAIN_TEST_RH_MCC |
|
|
65 |
``` |
|
|
66 |
|
|
|
67 |
#### Resume an execution |
|
|
68 |
Just add -resumeTraining=true in the command |
|
|
69 |
```Bash |
|
|
70 |
java -jar biodiscml.jar -config config.conf -train -resumeTraining=true |
|
|
71 |
``` |
|
|
72 |
|
|
|
73 |
#### Choose best model(s) |
|
|
74 |
```Bash |
|
|
75 |
java -jar biodiscml.jar -config config.conf -bestmodel |
|
|
76 |
``` |
|
|
77 |
When training completed, stopped or still in execution, best model |
|
|
78 |
selection can be executed. This command reads the results file. Best |
|
|
79 |
models are selected based on a strategy provided in config file. You |
|
|
80 |
can also choose your own models manually, by opening the results file |
|
|
81 |
in an excel-like program and order models by your favorite metrics or |
|
|
82 |
filters. Each model has an identifier (modelID) you can provide to |
|
|
83 |
the command. Example: |
|
|
84 |
```Bash |
|
|
85 |
java -jar biodiscml.jar -config config.conf -bestmodel modelID_1 modelID_2 |
|
|
86 |
``` |
|
|
87 |
|
|
|
88 |
#### Predict new data |
|
|
89 |
```Bash |
|
|
90 |
java -jar biodiscml.jar -config config.conf -predict |
|
|
91 |
``` |
|
|
92 |
Once the best model obtained, you can predict new data or test a blind test set |
|
|
93 |
put aside by yourself before training. The file should be of same format and structure |
|
|
94 |
as the training input files. This file must contain at least all features of the |
|
|
95 |
selected best model signature. Features present in the newData file but absent from |
|
|
96 |
the signature of the model will simply be ignored during the prediction. If a |
|
|
97 |
class to predict column is present, BioDiscML will return errors statistics. |
|
|
98 |
config.conf file example: |
|
|
99 |
``` |
|
|
100 |
project=myProjectName |
|
|
101 |
newDataFile=myNewData.csv |
|
|
102 |
doClassification=true |
|
|
103 |
classificationClassName=class |
|
|
104 |
modelFile=myBestModel.model |
|
|
105 |
``` |
|
|
106 |
|
|
|
107 |
### By command line |
|
|
108 |
The same parameters from config file can be used to be inputed in a command line. |
|
|
109 |
Example: |
|
|
110 |
```Bash |
|
|
111 |
time java -jar biodiscml.jar -train -project=myProject -excluded=excludedColumn |
|
|
112 |
-doClassification=true -classificationClassName=class -trainFile=data.csv |
|
|
113 |
-debug=true -bootstrapFolds=10 -loocv=false -cpus=10 -computeBestModel=false |
|
|
114 |
-classificationFastWay=true -ccmd=bayes.AveragedNDependenceEstimators.A1DE -F 1 -M 1.0 -W |
|
|
115 |
``` |
|
|
116 |
Note that the option -ccmd must stay at the end of the command line when classifier |
|
|
117 |
parameters follows it. |
|
|
118 |
|
|
|
119 |
|
|
|
120 |
## Output files |
|
|
121 |
Note: {project_name} is set in the config.conf file |
|
|
122 |
|
|
|
123 |
- {project_name}_a.* |
|
|
124 |
|
|
|
125 |
A csv file and a copy in arff format (weka input format) are created here. |
|
|
126 |
They contain the merged data of input files with some adaptations. |
|
|
127 |
|
|
|
128 |
- {project_name}_b.* |
|
|
129 |
|
|
|
130 |
A csv file and a copy in arff format (weka input format) are also created here. |
|
|
131 |
They are produced after feature ranking and are already a subset of |
|
|
132 |
{project_name}_a.*. Feature ranking is performed by Information gain |
|
|
133 |
for categorial class. Features having infogain <0.0001 are discarded. |
|
|
134 |
For numerical class, RELIEFF is used. Only best 1000 features are kept, |
|
|
135 |
or having a score greater than 0.0001. |
|
|
136 |
|
|
|
137 |
- {project_name}_c.*results.csv |
|
|
138 |
|
|
|
139 |
Results file. Summary of all trained model with their evaluation metrics |
|
|
140 |
and selected attributes. Use the bestmodel command to extract models. |
|
|
141 |
Column index of selected attributes column correspond to the |
|
|
142 |
{project_name}_b.*csv file. |
|
|
143 |
For each model, we perform various evaluations summarized in this table: |
|
|
144 |
|
|
|
145 |
| Header | Description | |
|
|
146 |
| ------------ | ------------ | |
|
|
147 |
| ID | Model unique identifier. Can be passed as argument for best model selection | |
|
|
148 |
| Classifier | Machine learning classifier name | |
|
|
149 |
| Options | Classifier hyperparameters options | |
|
|
150 |
| OptimizedValue | Optimized criterion used for feature selection procedure | |
|
|
151 |
| SearchMode | Type of feature selection procedure: <br>- Forward Stepwise Selection (F)<br>- Backward stepwise selection (B)<br>- Forward stepwise selection and Backward stepwise elimination (FB)<br>- Backward stepwise selection and Forward stepwise elimination (BF)<br>- "top k" features.| |
|
|
152 |
| nbrOfFeatures | Number of features in the signature | |
|
|
153 |
| TRAIN_10CV_ACC | 10 fold cross validation Accuracy on train set | |
|
|
154 |
| TRAIN_10CV_AUC | 10 fold cross validation Area Under The Curve on train set| |
|
|
155 |
| TRAIN_10CV_AUPRC |10 fold cross validation Area Under Precision Recall Curve on train set | |
|
|
156 |
| TRAIN_10CV_SEN | 10 fold cross validation Sensitivity on train set| |
|
|
157 |
| TRAIN_10CV_SPE |10 fold cross validation Specificity on train set | |
|
|
158 |
| TRAIN_10CV_MCC |10 fold cross validation Matthews Correlation Coefficient on train set | |
|
|
159 |
| TRAIN_10CV_MAE |10 fold cross validation Mean Absolute Error on train set | |
|
|
160 |
| TRAIN_10CV_BER | 10 fold cross validation Balanced Error Rate on train set| |
|
|
161 |
| TRAIN_10CV_FPR |10 fold cross validation False Positive Rate on train set | |
|
|
162 |
| TRAIN_10CV_FNR |10 fold cross validation False Negative Rate on train set | |
|
|
163 |
| TRAIN_10CV_PPV | 10 fold cross validation Positive Predictive value on train set| |
|
|
164 |
| TRAIN_10CV_FDR | 10 fold cross validation False Discovery Rate on train set| |
|
|
165 |
| TRAIN_10CV_Fscore | 10 fold cross validation F-score on train set| |
|
|
166 |
| TRAIN_10CV_kappa | 10 fold cross validation Kappa on train set| |
|
|
167 |
| TRAIN_matrix | 10 fold cross validation Matrix on train set | |
|
|
168 |
| TRAIN_LOOCV_ACC | Leave-One-Out Cross Validation Accuracy on Train set | |
|
|
169 |
| TRAIN_LOOCV_AUC | Leave-One-Out Cross Validation Area Under The Curve on Train set| |
|
|
170 |
| TRAIN_LOOCV_AUPRC | Leave-One-Out Cross Validation Area Under Precision Recall Curve on Train set| |
|
|
171 |
| TRAIN_LOOCV_SEN | Leave-One-Out Cross Validation Sensitivity on Train set| |
|
|
172 |
| TRAIN_LOOCV_SPE | Leave-One-Out Cross Validation Specificity on Train set | |
|
|
173 |
| TRAIN_LOOCV_MCC | Leave-One-Out Cross Validation Matthews Correlation Coefficient on Train set| |
|
|
174 |
| TRAIN_LOOCV_MAE | Leave-One-Out Cross Validation Mean Absolute Error on Train set| |
|
|
175 |
| TRAIN_LOOCV_BER | Leave-One-Out Cross Validation Balanced Error Rate on Train set| |
|
|
176 |
| TRAIN_RH_ACC | Repeated holdout Accuracy on Train set | |
|
|
177 |
| TRAIN_RH_AUC | Repeated holdout Area Under The Curve on Train set | |
|
|
178 |
| TRAIN_RH_AUPRC | Repeated holdout Area Under Precision Recall Curve on Train set | |
|
|
179 |
| TRAIN_RH_SEN | Repeated holdout Sensitivity on Train set | |
|
|
180 |
| TRAIN_RH_SPE | Repeated holdout Specificity on Train set | |
|
|
181 |
| TRAIN_RH_MCC | Repeated holdout Matthews Correlation Coefficient on Train set | |
|
|
182 |
| TRAIN_RH_MAE | Repeated holdout Mean Absolute Error on Train set | |
|
|
183 |
| TRAIN_RH_BER | Repeated holdout Balanced Error Rate on Train set | |
|
|
184 |
| TRAIN_BS_ACC | Bootstrap Accuracy on Train set | |
|
|
185 |
| TRAIN_BS_AUC | Bootstrap Area Under The Curve on Train set | |
|
|
186 |
| TRAIN_BS_AUPRC | Bootstrap Area Under Precision Recall Curve on Train set | |
|
|
187 |
| TRAIN_BS_SEN | Bootstrap Sensitivity on Train set | |
|
|
188 |
| TRAIN_BS_SPE | Bootstrap Specificity on Train set | |
|
|
189 |
| TRAIN_BS_MCC | Bootstrap Matthews Correlation Coefficient on Train set | |
|
|
190 |
| TRAIN_BS_MAE | Bootstrap Mean Absolute Error on Train set | |
|
|
191 |
| TRAIN_BS_BER | Bootstrap Balanced Error Rate on Train set | |
|
|
192 |
| TRAIN_BS.632+ | Bootstrap .632+ rule | |
|
|
193 |
| TEST_ACC | Evaluation Accuracy on test set| |
|
|
194 |
| TEST_AUC | Evaluation Area Under The Curve on test set| |
|
|
195 |
| TEST_AUPRC |Evaluation Area Under Precision Recall Curve on test set | |
|
|
196 |
| TEST_SEN |Evaluation Sensitivity on test set | |
|
|
197 |
| TEST_SPE | Evaluation Specificity on test set| |
|
|
198 |
| TEST_MCC | Evaluation Matthews Correlation Coefficient on test set| |
|
|
199 |
| TEST_MAE | Evaluation Mean Absolute Error on test set| |
|
|
200 |
| TEST_BER | Evaluation Balanced Error Rate on test set| |
|
|
201 |
| TRAIN_TEST_RH_ACC |Repeated holdout Accuracy on merged Train and Test sets | |
|
|
202 |
| TRAIN_TEST_RH_AUC | Repeated holdout Area Under The Curve on merged Train and Test sets| |
|
|
203 |
| TRAIN_TEST_RH_AUPRC | Repeated holdout Area Under Precision Recall Curve on merged Train and Test sets| |
|
|
204 |
| TRAIN_TEST_RH_SEN | Repeated holdout Sensitivity on merged Train and Test sets| |
|
|
205 |
| TRAIN_TEST_RH_SPE | Repeated holdout Specificity on merged Train and Test sets| |
|
|
206 |
| TRAIN_TEST_RH_MCC | Repeated holdout Matthews Correlation Coefficient on merged Train and Test sets| |
|
|
207 |
| TRAIN_TEST_RH_MAE | Repeated holdout Mean Absolute Error on merged Train and Test sets| |
|
|
208 |
| TRAIN_TEST_RH_BER | Repeated holdout Balanced Error Rate on merged Train and Test sets| |
|
|
209 |
| TRAIN_TEST_BS_ACC |Bootstrap Accuracy on merged Train and Test sets | |
|
|
210 |
| TRAIN_TEST_BS_AUC | Bootstrap Area Under The Curve on merged Train and Test sets| |
|
|
211 |
| TRAIN_TEST_BS_AUPRC | Bootstrap Area Under Precision Recall Curve on merged Train and Test sets| |
|
|
212 |
| TRAIN_TEST_BS_SEN | Bootstrap Sensitivity on merged Train and Test sets| |
|
|
213 |
| TRAIN_TEST_BS_SPE | Bootstrap Specificity on merged Train and Test sets| |
|
|
214 |
| TRAIN_TEST_BS_MCC | Bootstrap Matthews Correlation Coefficient on merged Train and Test sets| |
|
|
215 |
| TRAIN_TEST_BS_MAE | Bootstrap Mean Absolute Error on merged Train and Test sets| |
|
|
216 |
| TRAIN_TEST_BS_BER | Bootstrap Balanced Error Rate on merged Train and Test sets| |
|
|
217 |
| TRAIN_TEST_BS_BER_BS.632+ | Bootstrap .632+ rule on merged Train and Test sets| |
|
|
218 |
| AVG_BER | Average of all calculated Balanced Error Rates | |
|
|
219 |
| STD_BER | Standard deviation of the calculated Balanced Error Rates| |
|
|
220 |
| AVG_MAE | Average of all calculated Mean Absolute Errors | |
|
|
221 |
| STD_MAE |Standard deviation of the calculated Mean Absolute Errors | |
|
|
222 |
| AVG_MCC | Average of all calculated Matthews Correlation Coefficients | |
|
|
223 |
| STD_MCC |Standard deviation of the calculated Matthews Correlation Coefficients | |
|
|
224 |
| AttributeList | Selected features. Use the option -bestmodel to generate a report and get the features' full names| |
|
|
225 |
|
|
|
226 |
Note that all columns refering to a test set will be empty if no test set have been generated or provided |
|
|
227 |
|
|
|
228 |
- {project_name}_d.{model_name}_{model_hyperparameters}_{feature_search_mode}.*details.txt |
|
|
229 |
|
|
|
230 |
Detailled information about the model and its performance, with the full signature and |
|
|
231 |
correlated features. |
|
|
232 |
|
|
|
233 |
|
|
|
234 |
- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*features.csv |
|
|
235 |
|
|
|
236 |
Features retained by the model in csv. |
|
|
237 |
If a test set have been generated or provided, a file will be generated for: |
|
|
238 |
-- the train set (*.train_features.csv) |
|
|
239 |
-- both train and test sets (*all_features.csv) |
|
|
240 |
|
|
|
241 |
|
|
|
242 |
- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*corrFeatures.csv |
|
|
243 |
|
|
|
244 |
Features retained by the model with their correlated features in csv |
|
|
245 |
If a test set have been generated or provided, a file will be generated for: |
|
|
246 |
-- the train set (*.train_corrFeatures.csv) |
|
|
247 |
-- both train and test sets (*all_corrfeatures.csv) |
|
|
248 |
|
|
|
249 |
- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*roc.png |
|
|
250 |
|
|
|
251 |
Boostrap roc curves (EXPERIMENTAL) |
|
|
252 |
Must be enabled in configuration file. |
|
|
253 |
If a test set have been generated or provided, a roc curve picture will be generated |
|
|
254 |
for both train and test sets. |
|
|
255 |
|
|
|
256 |
|
|
|
257 |
- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*model |
|
|
258 |
|
|
|
259 |
Serialized model compatible with weka |
|
|
260 |
|