|
a/README.md |
|
b/README.md |
1 |
# MOVE (Multi-Omics Variational autoEncoder) |
1 |
# MOVE (Multi-Omics Variational autoEncoder) |
2 |
|
2 |
|
3 |
[](https://badge.fury.io/py/move-dl) |
3 |
[](https://badge.fury.io/py/move-dl)
|
4 |
[](https://move-dl.readthedocs.io/?badge=latest) |
4 |
[](https://move-dl.readthedocs.io/?badge=latest) |
5 |
|
5 |
|
6 |
The code in this repository can be used to run our Multi-Omics Variational |
6 |
The code in this repository can be used to run our Multi-Omics Variational
|
7 |
autoEncoder (MOVE) framework for integration of omics and clinical variabels |
7 |
autoEncoder (MOVE) framework for integration of omics and clinical variabels
|
8 |
spanning both categorial and continuous data. Our approach includes training |
8 |
spanning both categorial and continuous data. Our approach includes training
|
9 |
ensemble VAE models and using *in silico* perturbation experiments to identify |
9 |
ensemble VAE models and using *in silico* perturbation experiments to identify
|
10 |
cross omics associations. The manuscript has been published in Nature |
10 |
cross omics associations. The manuscript has been published in Nature
|
11 |
Biotechnology: |
11 |
Biotechnology: |
12 |
|
12 |
|
13 |
> Allesøe, R.L., Lundgaard, A.T., Hernández Medina, R. *et al*. Discovery of |
13 |
Allesøe, R.L., Lundgaard, A.T., Hernández Medina, R. *et al*. Discovery of
|
14 |
> drug–omics associations in type 2 diabetes with generative deep-learning |
14 |
drug–omics associations in type 2 diabetes with generative deep-learning
|
15 |
> models. *Nat Biotechnol* (2023). https://doi.org/10.1038/s41587-022-01520-x |
15 |
models. *Nat Biotechnol* (2023). https://doi.org/10.1038/s41587-022-01520-x |
16 |
|
16 |
|
17 |
We developed the method based on a Type 2 Diabetes cohort from the IMI DIRECT |
17 |
We developed the method based on a Type 2 Diabetes cohort from the IMI DIRECT
|
18 |
project containing 789 newly diagnosed T2D patients. The cohort and data |
18 |
project containing 789 newly diagnosed T2D patients. The cohort and data
|
19 |
creation is described in |
19 |
creation is described in
|
20 |
[Koivula et al.](https://dx.doi.org/10.1007%2Fs00125-019-4906-1) and |
20 |
[Koivula et al.](https://dx.doi.org/10.1007%2Fs00125-019-4906-1) and
|
21 |
[Wesolowska-Andersen et al.](https://doi.org/10.1016/j.xcrm.2021.100477). For |
21 |
[Wesolowska-Andersen et al.](https://doi.org/10.1016/j.xcrm.2021.100477). For
|
22 |
the analysis we included the following data: |
22 |
the analysis we included the following data: |
23 |
|
23 |
|
24 |
Multi-omics data sets: |
24 |
Multi-omics data sets:
|
25 |
``` |
25 |
```
|
26 |
Genomics |
26 |
Genomics
|
27 |
Transcriptomics |
27 |
Transcriptomics
|
28 |
Proteomics |
28 |
Proteomics
|
29 |
Metabolomics |
29 |
Metabolomics
|
30 |
Metagenomics |
30 |
Metagenomics
|
31 |
``` |
31 |
``` |
32 |
|
32 |
|
33 |
Other data sets: |
33 |
Other data sets:
|
34 |
``` |
34 |
```
|
35 |
Clinical data (blood measurements, imaging data, ...) |
35 |
Clinical data (blood measurements, imaging data, ...)
|
36 |
Questionnaire data (diet etc) |
36 |
Questionnaire data (diet etc)
|
37 |
Accelerometer data |
37 |
Accelerometer data
|
38 |
Medication data |
38 |
Medication data
|
39 |
``` |
39 |
``` |
40 |
|
40 |
|
41 |
# Installation |
41 |
# Installation |
42 |
|
42 |
|
43 |
## Installing MOVE package |
43 |
## Installing MOVE package |
44 |
|
44 |
|
45 |
MOVE is written in Python and can be installed using `pip`: |
45 |
MOVE is written in Python and can be installed using `pip`: |
46 |
|
46 |
|
47 |
```bash |
47 |
```bash
|
48 |
>>> pip install move-dl |
48 |
>>> pip install move-dl
|
49 |
``` |
49 |
``` |
50 |
|
50 |
|
51 |
## Requirements |
51 |
## Requirements |
52 |
|
52 |
|
53 |
MOVE should run on any environmnet where Python is available. The variational |
53 |
MOVE should run on any environmnet where Python is available. The variational
|
54 |
autoencoder architecture is implemented in PyTorch. |
54 |
autoencoder architecture is implemented in PyTorch. |
55 |
|
55 |
|
56 |
The training of the VAEs can be done using CPUs only or GPU acceleration. If |
56 |
The training of the VAEs can be done using CPUs only or GPU acceleration. If
|
57 |
you do not have powerful GPUs available, it is possible to run using only CPUs. |
57 |
you do not have powerful GPUs available, it is possible to run using only CPUs.
|
58 |
For instance, the tutorial data set consisting of simulated drug, metabolomics |
58 |
For instance, the tutorial data set consisting of simulated drug, metabolomics
|
59 |
and proteomics data for 500 individuals runs fine on a standard macbook. |
59 |
and proteomics data for 500 individuals runs fine on a standard macbook. |
60 |
|
60 |
|
61 |
> Note: The pip installation of `move-dl` does not setup your local GPU automatically |
61 |
Note: The pip installation of `move-dl` does not setup your local GPU automatically |
62 |
|
62 |
|
63 |
# The MOVE pipeline |
63 |
# The MOVE pipeline |
64 |
|
64 |
|
65 |
MOVE has five-six steps: |
65 |
MOVE has five-six steps: |
66 |
|
66 |
|
67 |
``` |
67 |
```
|
68 |
01. Encode the data into a format that can be read by MOVE |
68 |
01. Encode the data into a format that can be read by MOVE
|
69 |
02. Finding the right architecture of the network focusing on reconstruction accuracy |
69 |
02. Finding the right architecture of the network focusing on reconstruction accuracy
|
70 |
03. Finding the right architecture of the network focusing on stability of the model |
70 |
03. Finding the right architecture of the network focusing on stability of the model
|
71 |
04. Use model, determined from steps 02-03, to create and analyze the latent space |
71 |
04. Use model, determined from steps 02-03, to create and analyze the latent space
|
72 |
05. Identify associations between a categorical and continuous datasets |
72 |
05. Identify associations between a categorical and continuous datasets
|
73 |
05a. Using an ensemble of VAEs with the t-test approach |
73 |
05a. Using an ensemble of VAEs with the t-test approach
|
74 |
05b. Using an ensemble of VAEs with the Bayesian decision theory approach |
74 |
05b. Using an ensemble of VAEs with the Bayesian decision theory approach
|
75 |
06. If both 5a and 5b were run select the overlap between them |
75 |
06. If both 5a and 5b were run select the overlap between them
|
76 |
``` |
76 |
``` |
77 |
|
77 |
|
78 |
## How to run MOVE |
78 |
## How to run MOVE |
79 |
|
79 |
|
80 |
Please refer to our [**documentation**](https://move-dl.readthedocs.io/) for |
80 |
Please refer to our [**documentation**](https://move-dl.readthedocs.io/) for
|
81 |
examples and [tutorials](https://move-dl.readthedocs.io/tutorial/index.html) |
81 |
examples and [tutorials](https://move-dl.readthedocs.io/tutorial/index.html)
|
82 |
on how to run MOVE. |
82 |
on how to run MOVE. |
83 |
|
83 |
|
84 |
Additionally, you can copy |
84 |
Additionally, you can copy
|
85 |
[this notebook](https://colab.research.google.com/drive/1RFWNsuGymCmppPsElBvDuA9zRbGskKmi?usp=sharing) |
85 |
[this notebook](https://colab.research.google.com/drive/1RFWNsuGymCmppPsElBvDuA9zRbGskKmi?usp=sharing)
|
86 |
and follow its instructions to get familiar with our pipeline. |
86 |
and follow its instructions to get familiar with our pipeline. |
87 |
|
87 |
|
88 |
# Data sets |
88 |
# Data sets |
89 |
|
89 |
|
90 |
## DIRECT data set |
90 |
## DIRECT data set |
91 |
|
91 |
|
92 |
The data used in notebooks are not available for testing due to the informed |
92 |
The data used in notebooks are not available for testing due to the informed
|
93 |
consent given by study participants, the various national ethical approvals for |
93 |
consent given by study participants, the various national ethical approvals for
|
94 |
the study, and the European General Data Protection Regulation (GDPR). |
94 |
the study, and the European General Data Protection Regulation (GDPR).
|
95 |
Therefore, individual-level clinical and omics data cannot be transferred from |
95 |
Therefore, individual-level clinical and omics data cannot be transferred from
|
96 |
the centralized IMI-DIRECT repository. Requests for access to summary statistics |
96 |
the centralized IMI-DIRECT repository. Requests for access to summary statistics
|
97 |
IMI-DIRECT data, including those presented here, can be made to |
97 |
IMI-DIRECT data, including those presented here, can be made to
|
98 |
DIRECTdataaccess@Dundee.ac.uk. Requesters will be informed on how summary-level |
98 |
DIRECTdataaccess@Dundee.ac.uk. Requesters will be informed on how summary-level
|
99 |
data can be accessed via the DIRECT secure analysis platform following |
99 |
data can be accessed via the DIRECT secure analysis platform following
|
100 |
submission of appropriate application. The IMI-DIRECT data access policy is |
100 |
submission of appropriate application. The IMI-DIRECT data access policy is
|
101 |
available [here](https://directdiabetes.org). |
101 |
available [here](https://directdiabetes.org). |
102 |
|
102 |
|
103 |
## Simulated and publicaly available data sets |
103 |
## Simulated and publicaly available data sets |
104 |
|
104 |
|
105 |
We have therefore provided two datasets to test the workflow: a simulated |
105 |
We have therefore provided two datasets to test the workflow: a simulated
|
106 |
dataset and a publicly-available maize rhizosphere microbiome data set. |
106 |
dataset and a publicly-available maize rhizosphere microbiome data set. |
107 |
|
107 |
|
108 |
# Citation |
108 |
# Citation |
109 |
|
109 |
|
110 |
To cite MOVE, use the following information: |
110 |
To cite MOVE, use the following information: |
111 |
|
111 |
|
112 |
Allesøe, R.L., Lundgaard, A.T., Hernández Medina, R. *et al*. Discovery of |
112 |
Allesøe, R.L., Lundgaard, A.T., Hernández Medina, R. *et al*. Discovery of
|
113 |
drug–omics associations in type 2 diabetes with generative deep-learning models. |
113 |
drug–omics associations in type 2 diabetes with generative deep-learning models.
|
114 |
*Nat Biotechnol* (2023). https://doi.org/10.1038/s41587-022-01520-x |
114 |
*Nat Biotechnol* (2023). https://doi.org/10.1038/s41587-022-01520-x
|