|
a |
|
b/README.md |
|
|
1 |
# MOVE (Multi-Omics Variational autoEncoder) |
|
|
2 |
|
|
|
3 |
[](https://badge.fury.io/py/move-dl) |
|
|
4 |
[](https://move-dl.readthedocs.io/?badge=latest) |
|
|
5 |
|
|
|
6 |
The code in this repository can be used to run our Multi-Omics Variational |
|
|
7 |
autoEncoder (MOVE) framework for integration of omics and clinical variabels |
|
|
8 |
spanning both categorial and continuous data. Our approach includes training |
|
|
9 |
ensemble VAE models and using *in silico* perturbation experiments to identify |
|
|
10 |
cross omics associations. The manuscript has been published in Nature |
|
|
11 |
Biotechnology: |
|
|
12 |
|
|
|
13 |
> Allesøe, R.L., Lundgaard, A.T., Hernández Medina, R. *et al*. Discovery of |
|
|
14 |
> drug–omics associations in type 2 diabetes with generative deep-learning |
|
|
15 |
> models. *Nat Biotechnol* (2023). https://doi.org/10.1038/s41587-022-01520-x |
|
|
16 |
|
|
|
17 |
We developed the method based on a Type 2 Diabetes cohort from the IMI DIRECT |
|
|
18 |
project containing 789 newly diagnosed T2D patients. The cohort and data |
|
|
19 |
creation is described in |
|
|
20 |
[Koivula et al.](https://dx.doi.org/10.1007%2Fs00125-019-4906-1) and |
|
|
21 |
[Wesolowska-Andersen et al.](https://doi.org/10.1016/j.xcrm.2021.100477). For |
|
|
22 |
the analysis we included the following data: |
|
|
23 |
|
|
|
24 |
Multi-omics data sets: |
|
|
25 |
``` |
|
|
26 |
Genomics |
|
|
27 |
Transcriptomics |
|
|
28 |
Proteomics |
|
|
29 |
Metabolomics |
|
|
30 |
Metagenomics |
|
|
31 |
``` |
|
|
32 |
|
|
|
33 |
Other data sets: |
|
|
34 |
``` |
|
|
35 |
Clinical data (blood measurements, imaging data, ...) |
|
|
36 |
Questionnaire data (diet etc) |
|
|
37 |
Accelerometer data |
|
|
38 |
Medication data |
|
|
39 |
``` |
|
|
40 |
|
|
|
41 |
# Installation |
|
|
42 |
|
|
|
43 |
## Installing MOVE package |
|
|
44 |
|
|
|
45 |
MOVE is written in Python and can be installed using `pip`: |
|
|
46 |
|
|
|
47 |
```bash |
|
|
48 |
>>> pip install move-dl |
|
|
49 |
``` |
|
|
50 |
|
|
|
51 |
## Requirements |
|
|
52 |
|
|
|
53 |
MOVE should run on any environmnet where Python is available. The variational |
|
|
54 |
autoencoder architecture is implemented in PyTorch. |
|
|
55 |
|
|
|
56 |
The training of the VAEs can be done using CPUs only or GPU acceleration. If |
|
|
57 |
you do not have powerful GPUs available, it is possible to run using only CPUs. |
|
|
58 |
For instance, the tutorial data set consisting of simulated drug, metabolomics |
|
|
59 |
and proteomics data for 500 individuals runs fine on a standard macbook. |
|
|
60 |
|
|
|
61 |
> Note: The pip installation of `move-dl` does not setup your local GPU automatically |
|
|
62 |
|
|
|
63 |
# The MOVE pipeline |
|
|
64 |
|
|
|
65 |
MOVE has five-six steps: |
|
|
66 |
|
|
|
67 |
``` |
|
|
68 |
01. Encode the data into a format that can be read by MOVE |
|
|
69 |
02. Finding the right architecture of the network focusing on reconstruction accuracy |
|
|
70 |
03. Finding the right architecture of the network focusing on stability of the model |
|
|
71 |
04. Use model, determined from steps 02-03, to create and analyze the latent space |
|
|
72 |
05. Identify associations between a categorical and continuous datasets |
|
|
73 |
05a. Using an ensemble of VAEs with the t-test approach |
|
|
74 |
05b. Using an ensemble of VAEs with the Bayesian decision theory approach |
|
|
75 |
06. If both 5a and 5b were run select the overlap between them |
|
|
76 |
``` |
|
|
77 |
|
|
|
78 |
## How to run MOVE |
|
|
79 |
|
|
|
80 |
Please refer to our [**documentation**](https://move-dl.readthedocs.io/) for |
|
|
81 |
examples and [tutorials](https://move-dl.readthedocs.io/tutorial/index.html) |
|
|
82 |
on how to run MOVE. |
|
|
83 |
|
|
|
84 |
Additionally, you can copy |
|
|
85 |
[this notebook](https://colab.research.google.com/drive/1RFWNsuGymCmppPsElBvDuA9zRbGskKmi?usp=sharing) |
|
|
86 |
and follow its instructions to get familiar with our pipeline. |
|
|
87 |
|
|
|
88 |
# Data sets |
|
|
89 |
|
|
|
90 |
## DIRECT data set |
|
|
91 |
|
|
|
92 |
The data used in notebooks are not available for testing due to the informed |
|
|
93 |
consent given by study participants, the various national ethical approvals for |
|
|
94 |
the study, and the European General Data Protection Regulation (GDPR). |
|
|
95 |
Therefore, individual-level clinical and omics data cannot be transferred from |
|
|
96 |
the centralized IMI-DIRECT repository. Requests for access to summary statistics |
|
|
97 |
IMI-DIRECT data, including those presented here, can be made to |
|
|
98 |
DIRECTdataaccess@Dundee.ac.uk. Requesters will be informed on how summary-level |
|
|
99 |
data can be accessed via the DIRECT secure analysis platform following |
|
|
100 |
submission of appropriate application. The IMI-DIRECT data access policy is |
|
|
101 |
available [here](https://directdiabetes.org). |
|
|
102 |
|
|
|
103 |
## Simulated and publicaly available data sets |
|
|
104 |
|
|
|
105 |
We have therefore provided two datasets to test the workflow: a simulated |
|
|
106 |
dataset and a publicly-available maize rhizosphere microbiome data set. |
|
|
107 |
|
|
|
108 |
# Citation |
|
|
109 |
|
|
|
110 |
To cite MOVE, use the following information: |
|
|
111 |
|
|
|
112 |
Allesøe, R.L., Lundgaard, A.T., Hernández Medina, R. *et al*. Discovery of |
|
|
113 |
drug–omics associations in type 2 diabetes with generative deep-learning models. |
|
|
114 |
*Nat Biotechnol* (2023). https://doi.org/10.1038/s41587-022-01520-x |