--- a +++ b/README.md @@ -0,0 +1,260 @@ + +# BioDiscML + +Large-scale automatic feature selection for biomarker discovery in high-dimensional +OMICs data + +## Short description +Automates the execution of many machine learning algorithms across various +optimization and evaluation procedures to identify the best model and signature + +## Description +The identification of biomarker signatures in omics molecular profiling is an +important challenge to predict outcomes in precision medicine context, such as +patient disease susceptibility, diagnosis, prognosis and treatment response. To +identify these signatures we present BioDiscML (Biomarker Discovery by Machine +Learning), a tool that automates the analysis of complex biological datasets +using machine learning methods. From a collection of samples and their associated +characteristics, i.e. the biomarkers (e.g. gene expression, protein levels, +clinico-pathological data), the goal of BioDiscML is to produce a minimal subset +of biomarkers and a model that will predict efficiently a specified outcome. To +this purpose, BioDiscML uses a large variety of machine learning algorithms to +select the best combination of biomarkers for predicting either categorical or +continuous outcome from highly unbalanced datasets. Finally, BioDiscML also +retrieves correlated biomarkers not included in the final model to better +understand the signature. The software has been implemented to automate all +machine learning steps, including data pre-processing, feature selection, model +selection, and performance evaluation. +https://github.com/mickaelleclercq/BioDiscML/ + +See also BioDiscViz (https://gitlab.com/SBouirdene/biodiscviz.git), which +includes consensus feature search, to visualize your results. + +Full manuscript: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6532608/ + +## Requirements +JAVA 8 (https://www.java.com/en/download/) + +## Program usage +BioDiscML can be started either by creating a config file or using command line + +### By config file +Before executing BioDiscML, a config file must be created. Use the template to +create your own. Everything is detailed in the config.conf file. Examples are +available in the Test_datasets at: +https://github.com/mickaelleclercq/BioDiscML/tree/master/release/Test_datasets + +#### Train a new model +```Bash +java -jar biodiscml.jar -config config.conf -train +``` +config_myData.conf (text file) quick start content example (See release/Test_datasets folder) +This configuration takes as input a file (myData.csv) and name it myProjectName. +A sampling (default 2/3 for training and 1/3 for testing) is performed before +classification procedure to predict the myOutcome class. One best model will +be selected based on Repated Holdout performance MCC on the train set. +config.conf file example: +``` +project=myProjectName +trainFile=myData.csv +sampling=true +doClassification=true +classificationClassName=myOutcome +numberOfBestModels=1 +numberOfBestModelsSortingMetric=TRAIN_TEST_RH_MCC +``` + +#### Resume an execution +Just add -resumeTraining=true in the command +```Bash +java -jar biodiscml.jar -config config.conf -train -resumeTraining=true +``` + +#### Choose best model(s) +```Bash +java -jar biodiscml.jar -config config.conf -bestmodel +``` +When training completed, stopped or still in execution, best model +selection can be executed. This command reads the results file. Best +models are selected based on a strategy provided in config file. You +can also choose your own models manually, by opening the results file +in an excel-like program and order models by your favorite metrics or +filters. Each model has an identifier (modelID) you can provide to +the command. Example: +```Bash +java -jar biodiscml.jar -config config.conf -bestmodel modelID_1 modelID_2 +``` + +#### Predict new data +```Bash +java -jar biodiscml.jar -config config.conf -predict +``` +Once the best model obtained, you can predict new data or test a blind test set +put aside by yourself before training. The file should be of same format and structure +as the training input files. This file must contain at least all features of the +selected best model signature. Features present in the newData file but absent from +the signature of the model will simply be ignored during the prediction. If a +class to predict column is present, BioDiscML will return errors statistics. +config.conf file example: +``` +project=myProjectName +newDataFile=myNewData.csv +doClassification=true +classificationClassName=class +modelFile=myBestModel.model +``` + +### By command line +The same parameters from config file can be used to be inputed in a command line. +Example: +```Bash +time java -jar biodiscml.jar -train -project=myProject -excluded=excludedColumn +-doClassification=true -classificationClassName=class -trainFile=data.csv +-debug=true -bootstrapFolds=10 -loocv=false -cpus=10 -computeBestModel=false +-classificationFastWay=true -ccmd=bayes.AveragedNDependenceEstimators.A1DE -F 1 -M 1.0 -W +``` +Note that the option -ccmd must stay at the end of the command line when classifier +parameters follows it. + + +## Output files +Note: {project_name} is set in the config.conf file + +- {project_name}_a.* + +A csv file and a copy in arff format (weka input format) are created here. +They contain the merged data of input files with some adaptations. + +- {project_name}_b.* + +A csv file and a copy in arff format (weka input format) are also created here. +They are produced after feature ranking and are already a subset of +{project_name}_a.*. Feature ranking is performed by Information gain +for categorial class. Features having infogain <0.0001 are discarded. +For numerical class, RELIEFF is used. Only best 1000 features are kept, +or having a score greater than 0.0001. + +- {project_name}_c.*results.csv + +Results file. Summary of all trained model with their evaluation metrics +and selected attributes. Use the bestmodel command to extract models. +Column index of selected attributes column correspond to the +{project_name}_b.*csv file. +For each model, we perform various evaluations summarized in this table: + +| Header | Description | +| ------------ | ------------ | +| ID | Model unique identifier. Can be passed as argument for best model selection | +| Classifier | Machine learning classifier name | +| Options | Classifier hyperparameters options | +| OptimizedValue | Optimized criterion used for feature selection procedure | +| SearchMode | Type of feature selection procedure: <br>- Forward Stepwise Selection (F)<br>- Backward stepwise selection (B)<br>- Forward stepwise selection and Backward stepwise elimination (FB)<br>- Backward stepwise selection and Forward stepwise elimination (BF)<br>- "top k" features.| +| nbrOfFeatures | Number of features in the signature | +| TRAIN_10CV_ACC | 10 fold cross validation Accuracy on train set | +| TRAIN_10CV_AUC | 10 fold cross validation Area Under The Curve on train set| +| TRAIN_10CV_AUPRC |10 fold cross validation Area Under Precision Recall Curve on train set | +| TRAIN_10CV_SEN | 10 fold cross validation Sensitivity on train set| +| TRAIN_10CV_SPE |10 fold cross validation Specificity on train set | +| TRAIN_10CV_MCC |10 fold cross validation Matthews Correlation Coefficient on train set | +| TRAIN_10CV_MAE |10 fold cross validation Mean Absolute Error on train set | +| TRAIN_10CV_BER | 10 fold cross validation Balanced Error Rate on train set| +| TRAIN_10CV_FPR |10 fold cross validation False Positive Rate on train set | +| TRAIN_10CV_FNR |10 fold cross validation False Negative Rate on train set | +| TRAIN_10CV_PPV | 10 fold cross validation Positive Predictive value on train set| +| TRAIN_10CV_FDR | 10 fold cross validation False Discovery Rate on train set| +| TRAIN_10CV_Fscore | 10 fold cross validation F-score on train set| +| TRAIN_10CV_kappa | 10 fold cross validation Kappa on train set| +| TRAIN_matrix | 10 fold cross validation Matrix on train set | +| TRAIN_LOOCV_ACC | Leave-One-Out Cross Validation Accuracy on Train set | +| TRAIN_LOOCV_AUC | Leave-One-Out Cross Validation Area Under The Curve on Train set| +| TRAIN_LOOCV_AUPRC | Leave-One-Out Cross Validation Area Under Precision Recall Curve on Train set| +| TRAIN_LOOCV_SEN | Leave-One-Out Cross Validation Sensitivity on Train set| +| TRAIN_LOOCV_SPE | Leave-One-Out Cross Validation Specificity on Train set | +| TRAIN_LOOCV_MCC | Leave-One-Out Cross Validation Matthews Correlation Coefficient on Train set| +| TRAIN_LOOCV_MAE | Leave-One-Out Cross Validation Mean Absolute Error on Train set| +| TRAIN_LOOCV_BER | Leave-One-Out Cross Validation Balanced Error Rate on Train set| +| TRAIN_RH_ACC | Repeated holdout Accuracy on Train set | +| TRAIN_RH_AUC | Repeated holdout Area Under The Curve on Train set | +| TRAIN_RH_AUPRC | Repeated holdout Area Under Precision Recall Curve on Train set | +| TRAIN_RH_SEN | Repeated holdout Sensitivity on Train set | +| TRAIN_RH_SPE | Repeated holdout Specificity on Train set | +| TRAIN_RH_MCC | Repeated holdout Matthews Correlation Coefficient on Train set | +| TRAIN_RH_MAE | Repeated holdout Mean Absolute Error on Train set | +| TRAIN_RH_BER | Repeated holdout Balanced Error Rate on Train set | +| TRAIN_BS_ACC | Bootstrap Accuracy on Train set | +| TRAIN_BS_AUC | Bootstrap Area Under The Curve on Train set | +| TRAIN_BS_AUPRC | Bootstrap Area Under Precision Recall Curve on Train set | +| TRAIN_BS_SEN | Bootstrap Sensitivity on Train set | +| TRAIN_BS_SPE | Bootstrap Specificity on Train set | +| TRAIN_BS_MCC | Bootstrap Matthews Correlation Coefficient on Train set | +| TRAIN_BS_MAE | Bootstrap Mean Absolute Error on Train set | +| TRAIN_BS_BER | Bootstrap Balanced Error Rate on Train set | +| TRAIN_BS.632+ | Bootstrap .632+ rule | +| TEST_ACC | Evaluation Accuracy on test set| +| TEST_AUC | Evaluation Area Under The Curve on test set| +| TEST_AUPRC |Evaluation Area Under Precision Recall Curve on test set | +| TEST_SEN |Evaluation Sensitivity on test set | +| TEST_SPE | Evaluation Specificity on test set| +| TEST_MCC | Evaluation Matthews Correlation Coefficient on test set| +| TEST_MAE | Evaluation Mean Absolute Error on test set| +| TEST_BER | Evaluation Balanced Error Rate on test set| +| TRAIN_TEST_RH_ACC |Repeated holdout Accuracy on merged Train and Test sets | +| TRAIN_TEST_RH_AUC | Repeated holdout Area Under The Curve on merged Train and Test sets| +| TRAIN_TEST_RH_AUPRC | Repeated holdout Area Under Precision Recall Curve on merged Train and Test sets| +| TRAIN_TEST_RH_SEN | Repeated holdout Sensitivity on merged Train and Test sets| +| TRAIN_TEST_RH_SPE | Repeated holdout Specificity on merged Train and Test sets| +| TRAIN_TEST_RH_MCC | Repeated holdout Matthews Correlation Coefficient on merged Train and Test sets| +| TRAIN_TEST_RH_MAE | Repeated holdout Mean Absolute Error on merged Train and Test sets| +| TRAIN_TEST_RH_BER | Repeated holdout Balanced Error Rate on merged Train and Test sets| +| TRAIN_TEST_BS_ACC |Bootstrap Accuracy on merged Train and Test sets | +| TRAIN_TEST_BS_AUC | Bootstrap Area Under The Curve on merged Train and Test sets| +| TRAIN_TEST_BS_AUPRC | Bootstrap Area Under Precision Recall Curve on merged Train and Test sets| +| TRAIN_TEST_BS_SEN | Bootstrap Sensitivity on merged Train and Test sets| +| TRAIN_TEST_BS_SPE | Bootstrap Specificity on merged Train and Test sets| +| TRAIN_TEST_BS_MCC | Bootstrap Matthews Correlation Coefficient on merged Train and Test sets| +| TRAIN_TEST_BS_MAE | Bootstrap Mean Absolute Error on merged Train and Test sets| +| TRAIN_TEST_BS_BER | Bootstrap Balanced Error Rate on merged Train and Test sets| +| TRAIN_TEST_BS_BER_BS.632+ | Bootstrap .632+ rule on merged Train and Test sets| +| AVG_BER | Average of all calculated Balanced Error Rates | +| STD_BER | Standard deviation of the calculated Balanced Error Rates| +| AVG_MAE | Average of all calculated Mean Absolute Errors | +| STD_MAE |Standard deviation of the calculated Mean Absolute Errors | +| AVG_MCC | Average of all calculated Matthews Correlation Coefficients | +| STD_MCC |Standard deviation of the calculated Matthews Correlation Coefficients | +| AttributeList | Selected features. Use the option -bestmodel to generate a report and get the features' full names| + +Note that all columns refering to a test set will be empty if no test set have been generated or provided + +- {project_name}_d.{model_name}_{model_hyperparameters}_{feature_search_mode}.*details.txt + +Detailled information about the model and its performance, with the full signature and +correlated features. + + +- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*features.csv + +Features retained by the model in csv. +If a test set have been generated or provided, a file will be generated for: +-- the train set (*.train_features.csv) +-- both train and test sets (*all_features.csv) + + +- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*corrFeatures.csv + +Features retained by the model with their correlated features in csv +If a test set have been generated or provided, a file will be generated for: +-- the train set (*.train_corrFeatures.csv) +-- both train and test sets (*all_corrfeatures.csv) + +- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*roc.png + +Boostrap roc curves (EXPERIMENTAL) +Must be enabled in configuration file. +If a test set have been generated or provided, a roc curve picture will be generated +for both train and test sets. + + +- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*model + +Serialized model compatible with weka +