Data: Tabular Time Series Specialty: Endocrinology Laboratory: Blood Tests EHR: Demographics Diagnoses Medications Omics: Genomics Multi-omics Transcriptomics Wearable: Activity Clinical Purpose: Treatment Response Assessment Task: Biomarker Discovery
Diff of /docs/source/method.rst [000000] .. [c23b31]

Switch to unified view

a b/docs/source/method.rst
1
About the method
2
================
3
4
MOVE is based on the VAE (variational autoencoder) model, a deep learning model
5
that transforms high-dimensional data into a lower-dimensional space (so-called
6
latent representation). The autoencoder is made up of two neural networks: an
7
encoder, which compresses the input variables; and a decoder, which tries to
8
reconstruct the original input from the compressed representation. In doing so,
9
the model learns the structure and associations between the input variables.
10
11
In `our publication`_, we used this type of model to integrate different data
12
modalities, including: genomics, transcriptomics, proteomics, metabolomics,
13
microbiomes, medication data, diet questionnaires, and clinical measurements.
14
Once we obtained a trained model, we exploited the decoder network to identify
15
cross-omics associations.
16
17
Our approach consists of performing *in silico* perturbations of the original
18
data and using either univariate statistical methods or Bayesian decision
19
theory to identify significant differences between the reconstruction with or
20
without perturbation. Thus, we are able to detect associations between the
21
input variables.
22
23
.. _`our publication`: https://www.nature.com/articles/s41587-022-01520-x
24
25
.. image:: method/fig1.svg
26
27
VAE design
28
-----------
29
30
The VAE was designed to account for a variable number of fully-connected hidden
31
layers in both encoder and decoder. Each hidden layer is followed by batch
32
normalization, dropout, and a leaky rectified linear unit (leaky ReLU).
33
34
To integrate different modalities, each dataset is reshaped and concatenated
35
into an input matrix. Moreover, error calculation is done on a dataset
36
basis: binary cross-entropy for binary and categorical datasets and mean squared
37
error for continuous datasets. Each error :math:`E_i` is then multiplied by a
38
given weight :math:`W_i` and added up to form the loss function:
39
40
:math:`L = \sum_i W_i E_i + W_\textnormal{KL} D_\textnormal{KL}`
41
42
Note that the :math:`D_\textnormal{KL}` (Kullback–Leibler divergence) penalizes
43
deviance of the latent representation from the standard normal distribution. It
44
is also subject to a weight :math:`W_\textnormal{KL}`, which warms up as the
45
model is trained.
46
47
Extracting associations
48
-----------------------
49
50
After determining the right set of hyperparameters, associations are extracted
51
by perturbing the original input data and passing it through an ensemble of
52
trained models. The reason behind using an ensemble is that VAE models are
53
stochastic, so we need to ensure that the results we obtain are not a product
54
of chance.
55
56
We perturbed categorical data by changing its value from one category to
57
another (e.g., drug status changed from "not received" to "received"). Then, we
58
compare the change between the reconstruction generated from the original data
59
and the perturbed data. To achieve this, we proposed two approaches: using
60
*t*\ -test and Bayes factors. Both are described below:
61
62
MOVE *t*\ -test
63
^^^^^^^^^^^^^^^
64
65
#. Perturb a variable in one dataset.
66
#. Repeat 10 times for 4 different latent space sizes:
67
68
    #. Train VAE model with original data.
69
    #. Obtain reconstruction of original data (baseline reconstruction).
70
    #. Obtain 10 additional reconstructions of original data and calculate
71
       difference from the first (baseline difference).
72
    #. Obtain reconstruction of perturbed data (perturbed reconstruction) and
73
       subtract from baseline reconstruction (perturbed difference).
74
    #. Compute p-value between baseline and perturbed differences with t-test.
75
76
#. Correct p-values using Bonferroni method.
77
#. Select features that are significant (p-value lower than 0.05).
78
#. Select significant features that overlap in at least half of the refits and
79
   3 out of 4 architectures. These    features are associated with the
80
   perturbed variable.
81
82
MOVE Bayes
83
^^^^^^^^^^
84
85
#. Perturb a variable in one dataset.
86
#. Repeat 30 times:
87
88
    #. Train VAE model with original data.
89
    #. Obtain reconstruction of original data (baseline reconstruction).
90
    #. Obtain reconstruction of perturbed data (perturbed reconstruction).
91
    #. Record difference between baseline and perturbed reconstruction.
92
93
#. Compute probability of difference being greater than 0.
94
#. Compute Bayes factor from probability: :math:`K = \log p - \log (1 - p)`.
95
#. Sort probabilities by Bayes factor, from highest to lowest.
96
#. Compute false discovery rate (FDR) as cumulative evidence.
97
#. Select features whose FDR is above desired threshold (e.g., 0.05). These
98
   features are associated with the perturbed variable.