BioDiscML / Git / Diff of /README.md

Models:
DanielG/
BioDiscML
Downloads: 1
Diff of /README.md [000000] .. [418bf5]
Switch to side-by-side view

--- a
+++ b/README.md
@@ -0,0 +1,260 @@
+
+# BioDiscML
+
+Large-scale automatic feature selection for biomarker discovery in high-dimensional 
+OMICs data
+
+## Short description
+Automates the execution of many machine learning algorithms across various
+optimization and evaluation procedures to identify the best model and signature
+
+## Description
+The identification of biomarker signatures in omics molecular profiling is an 
+important challenge to predict outcomes in precision medicine context, such as 
+patient disease susceptibility, diagnosis, prognosis and treatment response. To 
+identify these signatures we present BioDiscML (Biomarker Discovery by Machine 
+Learning), a tool that automates the analysis of complex biological datasets 
+using machine learning methods. From a collection of samples and their associated 
+characteristics, i.e. the biomarkers (e.g. gene expression, protein levels, 
+clinico-pathological data), the goal of BioDiscML is to produce a minimal subset 
+of biomarkers and a model that will predict efficiently a specified outcome. To 
+this purpose, BioDiscML uses a large variety of machine learning algorithms to 
+select the best combination of biomarkers for predicting  either categorical or 
+continuous outcome from highly unbalanced datasets. Finally, BioDiscML also 
+retrieves correlated biomarkers not included in the final model to better 
+understand the signature. The software has been implemented to automate all 
+machine learning steps, including data pre-processing, feature selection, model 
+selection, and performance evaluation. 
+https://github.com/mickaelleclercq/BioDiscML/
+
+See also BioDiscViz (https://gitlab.com/SBouirdene/biodiscviz.git), which 
+includes consensus feature search, to visualize your results.
+
+Full manuscript: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6532608/
+
+## Requirements
+JAVA 8 (https://www.java.com/en/download/)
+
+## Program usage
+BioDiscML can be started either by creating a config file or using command line
+
+### By config file
+Before executing BioDiscML, a config file must be created. Use the template to 
+create your own. Everything is detailed in the config.conf file. Examples are 
+available in the Test_datasets at: 
+https://github.com/mickaelleclercq/BioDiscML/tree/master/release/Test_datasets
+
+#### Train a new model
+```Bash
+java -jar biodiscml.jar -config config.conf -train
+```
+config_myData.conf (text file) quick start content example (See release/Test_datasets folder)
+This configuration takes as input a file (myData.csv) and name it myProjectName.
+A sampling (default 2/3 for training and 1/3 for testing) is performed before 
+classification procedure to predict the myOutcome class. One best model will 
+be selected based on Repated Holdout performance MCC on the train set. 
+config.conf file example: 
+```
+project=myProjectName
+trainFile=myData.csv
+sampling=true
+doClassification=true
+classificationClassName=myOutcome
+numberOfBestModels=1
+numberOfBestModelsSortingMetric=TRAIN_TEST_RH_MCC
+```
+
+#### Resume an execution
+Just add -resumeTraining=true in the command
+```Bash
+java -jar biodiscml.jar -config config.conf -train -resumeTraining=true
+```
+
+#### Choose best model(s)
+```Bash
+java -jar biodiscml.jar -config config.conf -bestmodel 
+```
+When training completed, stopped or still in execution, best model 
+selection can be executed. This command reads the results file. Best 
+models are selected based on a strategy provided in config file. You 
+can also choose your own models manually, by opening the results file 
+in an excel-like program and order models by your favorite metrics or 
+filters. Each model has an identifier (modelID) you can provide to 
+the command. Example:
+```Bash
+java -jar biodiscml.jar -config config.conf -bestmodel modelID_1 modelID_2
+```
+
+#### Predict new data
+```Bash
+java -jar biodiscml.jar -config config.conf -predict 
+```
+Once the best model obtained, you can predict new data or test a blind test set 
+put aside by yourself before training. The file should be of same format and structure 
+as the training input files. This file must contain at least all features of the 
+selected best model signature. Features present in the newData file but absent from 
+the signature of the model will simply be ignored during the prediction. If a 
+class to predict column is present, BioDiscML will return errors statistics.
+config.conf file example: 
+```
+project=myProjectName
+newDataFile=myNewData.csv
+doClassification=true
+classificationClassName=class
+modelFile=myBestModel.model
+```
+
+### By command line
+The same parameters from config file can be used to be inputed in a command line.
+Example: 
+```Bash
+time java -jar biodiscml.jar -train -project=myProject -excluded=excludedColumn 
+-doClassification=true -classificationClassName=class -trainFile=data.csv 
+-debug=true -bootstrapFolds=10 -loocv=false -cpus=10 -computeBestModel=false 
+-classificationFastWay=true -ccmd=bayes.AveragedNDependenceEstimators.A1DE -F 1 -M 1.0 -W
+```
+Note that the option -ccmd must stay at the end of the command line when classifier
+parameters follows it. 
+
+
+## Output files
+Note: {project_name} is set in the config.conf file
+
+- {project_name}_a.*  
+
+A csv file and a copy in arff format (weka input format) are created here. 
+They contain the merged data of input files with some adaptations.
+
+- {project_name}_b.*
+
+A csv file and a copy in arff format (weka input format) are also created here. 
+They are produced after feature ranking and are already a subset of 
+{project_name}_a.*. Feature ranking is performed by Information gain 
+for categorial class. Features having infogain <0.0001 are discarded.
+For numerical class, RELIEFF is used. Only best 1000 features are kept, 
+or having a score greater than 0.0001.
+
+- {project_name}_c.*results.csv
+
+Results file. Summary of all trained model with their evaluation metrics 
+and selected attributes. Use the bestmodel command to extract models.
+Column index of selected attributes column correspond to the 
+{project_name}_b.*csv file. 
+For each model, we perform various evaluations summarized in this table:
+
+| Header | Description |
+| ------------ | ------------ |
+| ID | Model unique identifier. Can be passed as argument for best model selection |
+| Classifier | Machine learning classifier name |
+| Options | Classifier hyperparameters options |
+| OptimizedValue | Optimized criterion used for feature selection procedure |
+| SearchMode | Type of feature selection procedure: <br>- Forward Stepwise Selection (F)<br>- Backward stepwise selection (B)<br>- Forward stepwise selection and Backward stepwise elimination (FB)<br>- Backward stepwise selection and Forward stepwise elimination (BF)<br>- "top k" features.|
+| nbrOfFeatures | Number of features in the signature |
+| TRAIN_10CV_ACC | 10 fold cross validation Accuracy on train set |
+| TRAIN_10CV_AUC | 10 fold cross validation Area Under The Curve on train set|
+| TRAIN_10CV_AUPRC |10 fold cross validation Area Under Precision Recall Curve on train set |
+| TRAIN_10CV_SEN | 10 fold cross validation Sensitivity on train set|
+| TRAIN_10CV_SPE |10 fold cross validation Specificity on train set |
+| TRAIN_10CV_MCC |10 fold cross validation Matthews Correlation Coefficient on train set |
+| TRAIN_10CV_MAE |10 fold cross validation Mean Absolute Error on train set |
+| TRAIN_10CV_BER | 10 fold cross validation Balanced Error Rate on train set|
+| TRAIN_10CV_FPR |10 fold cross validation False Positive Rate on train set |
+| TRAIN_10CV_FNR |10 fold cross validation False Negative Rate on train set |
+| TRAIN_10CV_PPV | 10 fold cross validation Positive Predictive value on train set|
+| TRAIN_10CV_FDR | 10 fold cross validation False Discovery Rate on train set|
+| TRAIN_10CV_Fscore | 10 fold cross validation F-score on train set|
+| TRAIN_10CV_kappa | 10 fold cross validation Kappa on train set|
+| TRAIN_matrix | 10 fold cross validation Matrix on train set |
+| TRAIN_LOOCV_ACC | Leave-One-Out Cross Validation Accuracy on Train set |
+| TRAIN_LOOCV_AUC | Leave-One-Out Cross Validation Area Under The Curve on Train set|
+| TRAIN_LOOCV_AUPRC | Leave-One-Out Cross Validation Area Under Precision Recall Curve on Train set|
+| TRAIN_LOOCV_SEN | Leave-One-Out Cross Validation Sensitivity on Train set|
+| TRAIN_LOOCV_SPE | Leave-One-Out Cross Validation Specificity on Train set |
+| TRAIN_LOOCV_MCC | Leave-One-Out Cross Validation Matthews Correlation Coefficient on Train set|
+| TRAIN_LOOCV_MAE | Leave-One-Out Cross Validation Mean Absolute Error on Train set|
+| TRAIN_LOOCV_BER | Leave-One-Out Cross Validation Balanced Error Rate on Train set|
+| TRAIN_RH_ACC | Repeated holdout Accuracy on Train set |
+| TRAIN_RH_AUC | Repeated holdout Area Under The Curve on Train set |
+| TRAIN_RH_AUPRC | Repeated holdout Area Under Precision Recall Curve on Train set |
+| TRAIN_RH_SEN | Repeated holdout Sensitivity on Train set |
+| TRAIN_RH_SPE | Repeated holdout Specificity on Train set |
+| TRAIN_RH_MCC | Repeated holdout Matthews Correlation Coefficient on Train set |
+| TRAIN_RH_MAE | Repeated holdout Mean Absolute Error on Train set |
+| TRAIN_RH_BER | Repeated holdout Balanced Error Rate on Train set |
+| TRAIN_BS_ACC | Bootstrap Accuracy on Train set |
+| TRAIN_BS_AUC | Bootstrap Area Under The Curve on Train set |
+| TRAIN_BS_AUPRC | Bootstrap Area Under Precision Recall Curve on Train set |
+| TRAIN_BS_SEN | Bootstrap Sensitivity on Train set |
+| TRAIN_BS_SPE | Bootstrap Specificity on Train set |
+| TRAIN_BS_MCC | Bootstrap Matthews Correlation Coefficient on Train set |
+| TRAIN_BS_MAE | Bootstrap Mean Absolute Error on Train set |
+| TRAIN_BS_BER | Bootstrap Balanced Error Rate on Train set |
+| TRAIN_BS.632+ | Bootstrap .632+ rule |
+| TEST_ACC | Evaluation Accuracy on test set|
+| TEST_AUC | Evaluation Area Under The Curve on test set|
+| TEST_AUPRC |Evaluation Area Under Precision Recall Curve on test set |
+| TEST_SEN |Evaluation Sensitivity on test set |
+| TEST_SPE | Evaluation Specificity on test set|
+| TEST_MCC | Evaluation Matthews Correlation Coefficient on test set|
+| TEST_MAE | Evaluation Mean Absolute Error on test set|
+| TEST_BER | Evaluation Balanced Error Rate on test set|
+| TRAIN_TEST_RH_ACC |Repeated holdout Accuracy on merged Train and Test sets |
+| TRAIN_TEST_RH_AUC | Repeated holdout  Area Under The Curve on merged Train and Test sets|
+| TRAIN_TEST_RH_AUPRC | Repeated holdout Area Under Precision Recall Curve on merged Train and Test sets|
+| TRAIN_TEST_RH_SEN | Repeated holdout Sensitivity on merged Train and Test sets|
+| TRAIN_TEST_RH_SPE | Repeated holdout Specificity on merged Train and Test sets|
+| TRAIN_TEST_RH_MCC | Repeated holdout Matthews Correlation Coefficient on merged Train and Test sets|
+| TRAIN_TEST_RH_MAE | Repeated holdout Mean Absolute Error on merged Train and Test sets|
+| TRAIN_TEST_RH_BER | Repeated holdout Balanced Error Rate on merged Train and Test sets|
+| TRAIN_TEST_BS_ACC |Bootstrap Accuracy on merged Train and Test sets |
+| TRAIN_TEST_BS_AUC | Bootstrap  Area Under The Curve on merged Train and Test sets|
+| TRAIN_TEST_BS_AUPRC | Bootstrap Area Under Precision Recall Curve on merged Train and Test sets|
+| TRAIN_TEST_BS_SEN | Bootstrap Sensitivity on merged Train and Test sets|
+| TRAIN_TEST_BS_SPE | Bootstrap Specificity on merged Train and Test sets|
+| TRAIN_TEST_BS_MCC | Bootstrap Matthews Correlation Coefficient on merged Train and Test sets|
+| TRAIN_TEST_BS_MAE | Bootstrap Mean Absolute Error on merged Train and Test sets|
+| TRAIN_TEST_BS_BER | Bootstrap Balanced Error Rate on merged Train and Test sets|
+| TRAIN_TEST_BS_BER_BS.632+ | Bootstrap .632+ rule on merged Train and Test sets|
+| AVG_BER | Average of all calculated Balanced Error Rates |
+| STD_BER | Standard deviation of the calculated Balanced Error Rates|
+| AVG_MAE | Average of all calculated Mean Absolute Errors |
+| STD_MAE |Standard deviation of the calculated Mean Absolute Errors |
+| AVG_MCC | Average of all calculated Matthews Correlation Coefficients |
+| STD_MCC |Standard deviation of the calculated Matthews Correlation Coefficients |
+| AttributeList | Selected features. Use the option -bestmodel to generate a report and get the features' full names|
+
+Note that all columns refering to a test set will be empty if no test set have been generated or provided
+
+- {project_name}_d.{model_name}_{model_hyperparameters}_{feature_search_mode}.*details.txt
+
+Detailled information about the model and its performance, with the full signature and 
+correlated features.
+
+
+- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*features.csv
+
+Features retained by the model in csv.
+If a test set have been generated or provided, a file will be generated for:
+-- the train set (*.train_features.csv)
+-- both train and test sets (*all_features.csv)
+
+
+- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*corrFeatures.csv
+
+Features retained by the model with their correlated features in csv
+If a test set have been generated or provided, a file will be generated for:
+-- the train set (*.train_corrFeatures.csv)
+-- both train and test sets (*all_corrfeatures.csv)
+
+- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*roc.png
+
+Boostrap roc curves (EXPERIMENTAL)
+Must be enabled in configuration file. 
+If a test set have been generated or provided, a roc curve picture will be generated
+for both train and test sets.
+
+
+- {project_name}_.{model_name}_{model_hyperparameters}_{feature_search_mode}.*model
+
+Serialized model compatible with weka
+