|
a/README.md |
|
b/README.md |
1 |
# IntegratedLearner - Integrated machine learning for multi-omics prediction and classification |
1 |
# IntegratedLearner - Integrated machine learning for multi-omics prediction and classification |
2 |
|
2 |
|
3 |
The repository houses the **`IntegratedLearner`** R package for multi-omics prediction and classification. Both binary and continuous outcomes are supported. |
3 |
The repository houses the **`IntegratedLearner`** R package for multi-omics prediction and classification. Both binary and continuous outcomes are supported. |
4 |
|
4 |
|
5 |
## Dependencies |
5 |
## Dependencies |
6 |
|
6 |
|
7 |
`IntegratedLearner` requires the following `R` package: `devtools` (for installation only). Please install it before installing `IntegratedLearner `, which can be done as follows (execute from within a fresh R session): |
7 |
`IntegratedLearner` requires the following `R` package: `devtools` (for installation only). Please install it before installing `IntegratedLearner `, which can be done as follows (execute from within a fresh R session): |
8 |
|
8 |
|
9 |
```r |
9 |
```r
|
10 |
install.packages("devtools") |
10 |
install.packages("devtools")
|
11 |
library(devtools) |
11 |
library(devtools)
|
12 |
``` |
12 |
``` |
13 |
|
13 |
|
14 |
## Installation |
14 |
## Installation |
15 |
|
15 |
|
16 |
Once the dependencies are installed, `IntegratedLearner` can be loaded using the following command: |
16 |
Once the dependencies are installed, `IntegratedLearner` can be loaded using the following command: |
17 |
|
17 |
|
18 |
```r |
18 |
```r
|
19 |
devtools::install_github("himelmallick/IntegratedLearner") |
19 |
devtools::install_github("himelmallick/IntegratedLearner")
|
20 |
library(IntegratedLearner) |
20 |
library(IntegratedLearner)
|
21 |
``` |
21 |
``` |
22 |
|
22 |
|
23 |
### Run IntegratedLearner in a container |
23 |
### Run IntegratedLearner in a container |
24 |
|
24 |
|
25 |
IntegratedLearner can be run in a containerized environment using either Docker or Podman. It significantly simplifies the installation by ensuring that all the necessary packages are installed to run the provided vignette. |
25 |
IntegratedLearner can be run in a containerized environment using either Docker or Podman. It significantly simplifies the installation by ensuring that all the necessary packages are installed to run the provided vignette. |
26 |
|
26 |
|
27 |
Refer to the installation instructions for your operating system for [Docker](https://docs.docker.com/engine/install/) or [Podman](https://podman.io/docs/installation). Then, in the terminal, run |
27 |
Refer to the installation instructions for your operating system for [Docker](https://docs.docker.com/engine/install/) or [Podman](https://podman.io/docs/installation). Then, in the terminal, run |
28 |
|
28 |
|
29 |
```bash |
29 |
```bash
|
30 |
# Pull the container image from the registry |
30 |
# Pull the container image from the registry
|
31 |
docker pull ghcr.io/himelmallick/integratedlearner:master |
31 |
docker pull ghcr.io/himelmallick/integratedlearner:master |
32 |
|
32 |
|
33 |
# Start the container named IntegratedLearner on port 8787 |
33 |
# Start the container named IntegratedLearner on port 8787
|
34 |
docker run --port 8787:8787 --name IntegratedLearner integratedlearner:master |
34 |
docker run --port 8787:8787 --name IntegratedLearner integratedlearner:master
|
35 |
``` |
35 |
``` |
36 |
|
36 |
|
37 |
In the browser, navigate to `localhost:8787` and log in with `rstudio` username and the password that was displayed in the terminal. |
37 |
In the browser, navigate to `localhost:8787` and log in with `rstudio` username and the password that was displayed in the terminal. |
38 |
|
38 |
|
39 |
In the R console, write `setwd("/opt/pkg")`. You can now open any file in the IntegratedLearner repository. |
39 |
In the R console, write `setwd("/opt/pkg")`. You can now open any file in the IntegratedLearner repository. |
40 |
|
40 |
|
41 |
Podman is compatible with Docker commands, therefore `docker` command can be substituted with `podman`. |
41 |
Podman is compatible with Docker commands, therefore `docker` command can be substituted with `podman`. |
42 |
|
42 |
|
43 |
**NOTE**: if running rootless Podman, the correct username *might be* `root` instead of `rstudio`. |
43 |
**NOTE**: if running rootless Podman, the correct username *might be* `root` instead of `rstudio`. |
44 |
|
44 |
|
45 |
#### Map local directory to container directory |
45 |
#### Map local directory to container directory |
46 |
|
46 |
|
47 |
If you would like to make changes to the code, you need to *map* the local directory to a directory inside the container. |
47 |
If you would like to make changes to the code, you need to *map* the local directory to a directory inside the container.
|
48 |
Otherwise, the modifications will be discarded when the container is stopped. To do so, we need to specify a volume option: |
48 |
Otherwise, the modifications will be discarded when the container is stopped. To do so, we need to specify a volume option: |
49 |
|
49 |
|
50 |
```bash |
50 |
```bash
|
51 |
docker run -p 8787:8787 -v .:/IntegratedLearner --name IntegratedLearner integratedlearner:master |
51 |
docker run -p 8787:8787 -v .:/IntegratedLearner --name IntegratedLearner integratedlearner:master
|
52 |
``` |
52 |
``` |
53 |
|
53 |
|
54 |
In this command, we *map* the current directory (for example, IntegratedLearner repository) to the `/IntegratedLearner` |
54 |
In this command, we *map* the current directory (for example, IntegratedLearner repository) to the `/IntegratedLearner`
|
55 |
directory inside the container. After logging in RStudio Server, in the console, write `setwd("/IntegratedLearner")` and modify the files. |
55 |
directory inside the container. After logging in RStudio Server, in the console, write `setwd("/IntegratedLearner")` and modify the files.
|
56 |
The modifications made *inside* the container will be persistently saved in the current directory of the host system. |
56 |
The modifications made *inside* the container will be persistently saved in the current directory of the host system. |
57 |
|
57 |
|
58 |
**NOTE**: if you are using SELinux (often enabled by default on Fedora), and you receive *Permission denied* errors when |
58 |
**NOTE**: if you are using SELinux (often enabled by default on Fedora), and you receive *Permission denied* errors when
|
59 |
accessing files inside the container, add a `:Z` flag to the volume option: `.:/IntegratedLearner:Z`. |
59 |
accessing files inside the container, add a `:Z` flag to the volume option: `.:/IntegratedLearner:Z`. |
60 |
|
60 |
|
61 |
## Features |
61 |
## Features
|
62 |
* Supports early, late, and intermediate fusion with one line of code |
62 |
* Supports early, late, and intermediate fusion with one line of code
|
63 |
* Dozens of algorithms: Random Forest, LASSO, Elastic Net, SVM, BART, and more |
63 |
* Dozens of algorithms: Random Forest, LASSO, Elastic Net, SVM, BART, and more
|
64 |
* Integrates with [SuperLearner](https://cran.r-project.org/web/packages/SuperLearner/index.html) to support even more options to quickly add custom algorithms to the ensemble |
64 |
* Integrates with [SuperLearner](https://cran.r-project.org/web/packages/SuperLearner/index.html) to support even more options to quickly add custom algorithms to the ensemble
|
65 |
* Visualization using built-in plotting |
65 |
* Visualization using built-in plotting
|
66 |
* Hyperparameter tuning |
66 |
* Hyperparameter tuning
|
67 |
* Screening algorithms |
67 |
* Screening algorithms
|
68 |
* Options to add new algorithms or change the default parameters for existing ones |
68 |
* Options to add new algorithms or change the default parameters for existing ones
|
69 |
* Nested cross-validation to estimate the performance of the integrated machine learner |
69 |
* Nested cross-validation to estimate the performance of the integrated machine learner
|
70 |
* Multicore and multinode parallelization for scalability (**Not yet available**) |
70 |
* Multicore and multinode parallelization for scalability (**Not yet available**) |
71 |
|
71 |
|
72 |
## Quickstart Guide |
72 |
## Quickstart Guide |
73 |
|
73 |
|
74 |
The package vignette demonstrates how to use the **IntegratedLearner** workflow to perform a multi-omics prediction and classification task. This vignette can be viewed online [here](http://htmlpreview.github.io/?https://github.com/himelmallick/IntegratedLearner/blob/master/vignettes/IntegratedLearner.html). |
74 |
The package vignette demonstrates how to use the **IntegratedLearner** workflow to perform a multi-omics prediction and classification task. This vignette can be viewed online [here](http://htmlpreview.github.io/?https://github.com/himelmallick/IntegratedLearner/blob/master/vignettes/IntegratedLearner.html). |
75 |
|
75 |
|
76 |
## Background |
76 |
## Background |
77 |
|
77 |
|
78 |
**`IntegratedLearner`** provides an integrated machine learning framework to 1) consolidate predictions by borrowing information across several longitudinal and cross-sectional omics data layers, 2) decipher the mechanistic role of individual omics features that can potentially lead to new sets of testable hypotheses, and 3) quantify uncertainty of the integration process. Three types of integration paradigms are supported: early, late, and intermediate. The software includes multiple ML models based on the [SuperLearner R package](https://cran.r-project.org/web/packages/SuperLearner/index.html) as well as several data exploration capabilities and visualization modules in a unified estimation framework. |
78 |
**`IntegratedLearner`** provides an integrated machine learning framework to 1) consolidate predictions by borrowing information across several longitudinal and cross-sectional omics data layers, 2) decipher the mechanistic role of individual omics features that can potentially lead to new sets of testable hypotheses, and 3) quantify uncertainty of the integration process. Three types of integration paradigms are supported: early, late, and intermediate. The software includes multiple ML models based on the [SuperLearner R package](https://cran.r-project.org/web/packages/SuperLearner/index.html) as well as several data exploration capabilities and visualization modules in a unified estimation framework. |
79 |
|
79 |
|
80 |
At the core, the **`IntegratedLearner`** late fusion algorithm proceeds by 1) fitting a machine learning algorithm (```base_learner```) per-layer to predict outcome and 2) combining the layer-wise cross-validated predictions using a meta model (```meta_learner```) to generate final predictions based on all available data points. As a default choice, we recommend [Bayesian additive regression trees (BART)](https://arxiv.org/abs/0806.3286) as the base learner (```base_learner = 'SL.BART'```) and non-negative least squares/ rank loss minimization as the meta model algorithm (```meta_learner = 'SL.nnls.auc'```). ```'SL.nnls.auc'``` fits a non-negative least squares (in case of a continuous outcome) and rank loss minimization (in case of binary outcome) on layer-wise cross-validated predictions to generate the final predictions and quantify per-layer contributions. |
80 |
At the core, the **`IntegratedLearner`** late fusion algorithm proceeds by 1) fitting a machine learning algorithm (```base_learner```) per-layer to predict outcome and 2) combining the layer-wise cross-validated predictions using a meta model (```meta_learner```) to generate final predictions based on all available data points. As a default choice, we recommend [Bayesian additive regression trees (BART)](https://arxiv.org/abs/0806.3286) as the base learner (```base_learner = 'SL.BART'```) and non-negative least squares/ rank loss minimization as the meta model algorithm (```meta_learner = 'SL.nnls.auc'```). ```'SL.nnls.auc'``` fits a non-negative least squares (in case of a continuous outcome) and rank loss minimization (in case of binary outcome) on layer-wise cross-validated predictions to generate the final predictions and quantify per-layer contributions. |
81 |
|
81 |
|
82 |
In addition, >50 ML algorithms are supported. Note that, all the learners must be named such that they are preceeded by the prefix `SL.` followed by the name of the learner or the associated package (e.g., `SL.randomForest`, `SL.BART`, `SL.glmnet`, etc.). Please check out the [SuperLearner user manual](https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html) for all available options. |
82 |
In addition, >50 ML algorithms are supported. Note that, all the learners must be named such that they are preceeded by the prefix `SL.` followed by the name of the learner or the associated package (e.g., `SL.randomForest`, `SL.BART`, `SL.glmnet`, etc.). Please check out the [SuperLearner user manual](https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html) for all available options. |
83 |
|
83 |
|
84 |
## Basic Usage |
84 |
## Basic Usage |
85 |
|
85 |
|
86 |
``` |
86 |
```
|
87 |
IntegratedLearner(feature_table, sample_metadata, feature_metadata, ...) |
87 |
IntegratedLearner(feature_table, sample_metadata, feature_metadata, ...)
|
88 |
``` |
88 |
```
|
89 |
### Arguments |
89 |
### Arguments |
90 |
|
90 |
|
91 |
* ```feature_table ```: Data frame representing concatenated multi-omics features with features in rows (```rownames```) and samples in columns (```colnames```). |
91 |
* ```feature_table ```: Data frame representing concatenated multi-omics features with features in rows (```rownames```) and samples in columns (```colnames```).
|
92 |
* ```sample_metadata ```: Data frame of sample-specific metadata. Must have a column named ```subjectID``` describing per-subject unique identifiers. For longitudinal designs, this variable is expected to have non-unique values. Additionally, a column named ```Y``` must be present which is the outcome of interest (can be binary or continuous). Row names of ```sample_metadata``` must match the column names of ```feature_table```. |
92 |
* ```sample_metadata ```: Data frame of sample-specific metadata. Must have a column named ```subjectID``` describing per-subject unique identifiers. For longitudinal designs, this variable is expected to have non-unique values. Additionally, a column named ```Y``` must be present which is the outcome of interest (can be binary or continuous). Row names of ```sample_metadata``` must match the column names of ```feature_table```.
|
93 |
* ```feature_metadata ```: Data frame containing feature-specific metadata. Must have a column named ```featureID``` describing per-feature unique identifiers. Additionally, if multiple omics layers are present, a column named ```featureType``` should describe the corresponding source layer (e.g. metagenomics, metabolomics, etc.). Row names must match that of ```feature_table```. |
93 |
* ```feature_metadata ```: Data frame containing feature-specific metadata. Must have a column named ```featureID``` describing per-feature unique identifiers. Additionally, if multiple omics layers are present, a column named ```featureType``` should describe the corresponding source layer (e.g. metagenomics, metabolomics, etc.). Row names must match that of ```feature_table```.
|
94 |
* ```feature_table_valid ```: Optional feature table from validation set. Must have the exact same structure as `feature_table`. |
94 |
* ```feature_table_valid ```: Optional feature table from validation set. Must have the exact same structure as `feature_table`.
|
95 |
* ```sample_metadata_valid```: Optional sample-specific metadata table from independent validation set. Must have the exact same structure as `sample_metadata`. |
95 |
* ```sample_metadata_valid```: Optional sample-specific metadata table from independent validation set. Must have the exact same structure as `sample_metadata`.
|
96 |
* ```family```: A character string representing one of the built-in families. Currently, ```gaussian()``` and ```binomial()``` are supported. |
96 |
* ```family```: A character string representing one of the built-in families. Currently, ```gaussian()``` and ```binomial()``` are supported.
|
97 |
* ```folds```: Integer. Number of folds for cross-validation. Default is 5. |
97 |
* ```folds```: Integer. Number of folds for cross-validation. Default is 5.
|
98 |
* ```base_learner ```: Character string representing the name of the ```SL``` base-learner in stacked generalization and optionally for joint learner (see example). Check out the [SL user manual](https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html) for all available options. Default is ```'SL.BART'``` |
98 |
* ```base_learner ```: Character string representing the name of the ```SL``` base-learner in stacked generalization and optionally for joint learner (see example). Check out the [SL user manual](https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html) for all available options. Default is ```'SL.BART'```
|
99 |
* ```meta_learner```: Character string representing the name of the ```SL``` meta-learner in stacked generalization (see example). Check out the [SL user manual](https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html) for all available options. Default is ```'SL.nnls.auc'``` |
99 |
* ```meta_learner```: Character string representing the name of the ```SL``` meta-learner in stacked generalization (see example). Check out the [SL user manual](https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html) for all available options. Default is ```'SL.nnls.auc'```
|
100 |
* ```run_concat```: Logical value representing whether a joint (concatenated) model should also be run (see tutorial). Default is TRUE. |
100 |
* ```run_concat```: Logical value representing whether a joint (concatenated) model should also be run (see tutorial). Default is TRUE.
|
101 |
* ```run_stacked```: Logical value representing whether a stacked model should also be run (see tutorial). Default is TRUE. |
101 |
* ```run_stacked```: Logical value representing whether a stacked model should also be run (see tutorial). Default is TRUE.
|
102 |
* ```print_learner```: Logical value representing whether a summary of fit should be printed. Default is TRUE. |
102 |
* ```print_learner```: Logical value representing whether a summary of fit should be printed. Default is TRUE.
|
103 |
* ```verbose```: Logical value for printing progress during the computation (helpful for debugging). Default is FALSE. |
103 |
* ```verbose```: Logical value for printing progress during the computation (helpful for debugging). Default is FALSE.
|
104 |
* ```...```: Additional arguments for `SL` tuning parameters. |
104 |
* ```...```: Additional arguments for `SL` tuning parameters. |
105 |
|
105 |
|
106 |
#### The IntegratedLearner workflow |
106 |
#### The IntegratedLearner workflow |
107 |
|
107 |
|
108 |
 |
108 |
 |
109 |
|
109 |
|
110 |
#### Value |
110 |
#### Value |
111 |
|
111 |
|
112 |
* ```SL_fits```: A list of ```SL``` prediction results from all individual base learners, the meta learner, and optionally the joint (concatenation) learner. |
112 |
* ```SL_fits```: A list of ```SL``` prediction results from all individual base learners, the meta learner, and optionally the joint (concatenation) learner.
|
113 |
* ```model_fits```: A list of ```base_learner``` objects extracted from ```SL_fits``` for all individual base learners, meta learner, and optionally the joint (concatenation) learner. |
113 |
* ```model_fits```: A list of ```base_learner``` objects extracted from ```SL_fits``` for all individual base learners, meta learner, and optionally the joint (concatenation) learner.
|
114 |
* ```X_train_layers```: Input feature matrices for individual layers for training data. |
114 |
* ```X_train_layers```: Input feature matrices for individual layers for training data.
|
115 |
* ```Y_train```: Input response vector for training data. |
115 |
* ```Y_train```: Input response vector for training data.
|
116 |
* ```yhat.train```: Predictions for training data from all individual base learners, the meta learner, and optionally the joint (concatenation) learner. |
116 |
* ```yhat.train```: Predictions for training data from all individual base learners, the meta learner, and optionally the joint (concatenation) learner.
|
117 |
* ```X_test_layers```: Input feature matrices for individual layers for test data. Available if ```feature_table_valid``` is provided. |
117 |
* ```X_test_layers```: Input feature matrices for individual layers for test data. Available if ```feature_table_valid``` is provided.
|
118 |
* ```Y_test```: Input response vector for test data. |
118 |
* ```Y_test```: Input response vector for test data.
|
119 |
* ```weights```: Estimated layer weights in the meta model. Available if ```run_stacked=TRUE``` and ```meta_learner='SL.nnls.auc'```. |
119 |
* ```weights```: Estimated layer weights in the meta model. Available if ```run_stacked=TRUE``` and ```meta_learner='SL.nnls.auc'```.
|
120 |
* ```AUC.train```/```R2.train```: AUC/ R2 metrics calculated on training data using ```yhat.train``` and ```Y_train```. |
120 |
* ```AUC.train```/```R2.train```: AUC/ R2 metrics calculated on training data using ```yhat.train``` and ```Y_train```.
|
121 |
* ```AUC.test```/```R2.test```: AUC/ R2 metrics calculated on test data using ```yhat.test``` and ```Y_test```. |
121 |
* ```AUC.test```/```R2.test```: AUC/ R2 metrics calculated on test data using ```yhat.test``` and ```Y_test```.
|
122 |
* ```...```: Additional arguments containing information about inputs. |
122 |
* ```...```: Additional arguments containing information about inputs. |
123 |
|
123 |
|
124 |
Citation |
124 |
Citation
|
125 |
-------- |
125 |
-------- |
126 |
|
126 |
|
127 |
If you use `IntegratedLearner` in your work, please cite the following: |
127 |
If you use `IntegratedLearner` in your work, please cite the following: |
128 |
|
128 |
|
129 |
Mallick H et al. (2024). [An Integrated Bayesian Framework for Multi-omics Prediction and Classification](https://onlinelibrary.wiley.com/doi/10.1002/sim.9953). *Statistics in Medicine* 43(5):983–1002. |
129 |
Mallick H et al. (2024). [An Integrated Bayesian Framework for Multi-omics Prediction and Classification](https://onlinelibrary.wiley.com/doi/10.1002/sim.9953). *Statistics in Medicine* 43(5):983–1002. |
130 |
|
130 |
|
131 |
Issues |
131 |
Issues
|
132 |
------ |
132 |
------ |
133 |
|
133 |
|
134 |
We are happy to troubleshoot any issues with the package. Please contact the maintainer via email or [open an issue](https://github.com/himelmallick/IntegratedLearner/issues) in the GitHub repository. |
134 |
We are happy to troubleshoot any issues with the package. Please contact the maintainer via email or [open an issue](https://github.com/himelmallick/IntegratedLearner/issues) in the GitHub repository. |
135 |
|
135 |
|
136 |
Future Release |
136 |
Future Release
|
137 |
-------------- |
137 |
-------------- |
138 |
|
138 |
|
139 |
We are currently in the process of submitting |
139 |
We are currently in the process of submitting
|
140 |
**`IntegratedLearner`** to [Bioconductor](https://www.bioconductor.org/). Likewise, please keep an eye out for a future release of **`IntegratedLearner`** as an R/Bioconductor package while this repository remains the development version of the package. |
140 |
**`IntegratedLearner`** to [Bioconductor](https://www.bioconductor.org/). Likewise, please keep an eye out for a future release of **`IntegratedLearner`** as an R/Bioconductor package while this repository remains the development version of the package.
|