Data: Tabular Time Series Specialty: Endocrinology Laboratory: Blood Tests EHR: Demographics Diagnoses Medications Omics: Genomics Multi-omics Transcriptomics Wearable: Activity Clinical Purpose: Treatment Response Assessment Task: Biomarker Discovery
Diff of /docs/source/method.rst [000000] .. [c23b31]

Switch to side-by-side view

--- a
+++ b/docs/source/method.rst
@@ -0,0 +1,98 @@
+About the method
+================
+
+MOVE is based on the VAE (variational autoencoder) model, a deep learning model
+that transforms high-dimensional data into a lower-dimensional space (so-called
+latent representation). The autoencoder is made up of two neural networks: an
+encoder, which compresses the input variables; and a decoder, which tries to
+reconstruct the original input from the compressed representation. In doing so,
+the model learns the structure and associations between the input variables.
+
+In `our publication`_, we used this type of model to integrate different data
+modalities, including: genomics, transcriptomics, proteomics, metabolomics,
+microbiomes, medication data, diet questionnaires, and clinical measurements.
+Once we obtained a trained model, we exploited the decoder network to identify
+cross-omics associations.
+
+Our approach consists of performing *in silico* perturbations of the original
+data and using either univariate statistical methods or Bayesian decision
+theory to identify significant differences between the reconstruction with or
+without perturbation. Thus, we are able to detect associations between the
+input variables.
+
+.. _`our publication`: https://www.nature.com/articles/s41587-022-01520-x
+
+.. image:: method/fig1.svg
+
+VAE design
+-----------
+
+The VAE was designed to account for a variable number of fully-connected hidden
+layers in both encoder and decoder. Each hidden layer is followed by batch
+normalization, dropout, and a leaky rectified linear unit (leaky ReLU).
+
+To integrate different modalities, each dataset is reshaped and concatenated
+into an input matrix. Moreover, error calculation is done on a dataset
+basis: binary cross-entropy for binary and categorical datasets and mean squared
+error for continuous datasets. Each error :math:`E_i` is then multiplied by a
+given weight :math:`W_i` and added up to form the loss function:
+
+:math:`L = \sum_i W_i E_i + W_\textnormal{KL} D_\textnormal{KL}`
+
+Note that the :math:`D_\textnormal{KL}` (Kullback–Leibler divergence) penalizes
+deviance of the latent representation from the standard normal distribution. It
+is also subject to a weight :math:`W_\textnormal{KL}`, which warms up as the
+model is trained.
+
+Extracting associations
+-----------------------
+
+After determining the right set of hyperparameters, associations are extracted
+by perturbing the original input data and passing it through an ensemble of
+trained models. The reason behind using an ensemble is that VAE models are
+stochastic, so we need to ensure that the results we obtain are not a product
+of chance.
+
+We perturbed categorical data by changing its value from one category to
+another (e.g., drug status changed from "not received" to "received"). Then, we
+compare the change between the reconstruction generated from the original data
+and the perturbed data. To achieve this, we proposed two approaches: using
+*t*\ -test and Bayes factors. Both are described below:
+
+MOVE *t*\ -test
+^^^^^^^^^^^^^^^
+
+#. Perturb a variable in one dataset.
+#. Repeat 10 times for 4 different latent space sizes:
+
+    #. Train VAE model with original data.
+    #. Obtain reconstruction of original data (baseline reconstruction).
+    #. Obtain 10 additional reconstructions of original data and calculate
+       difference from the first (baseline difference).
+    #. Obtain reconstruction of perturbed data (perturbed reconstruction) and
+       subtract from baseline reconstruction (perturbed difference).
+    #. Compute p-value between baseline and perturbed differences with t-test.
+
+#. Correct p-values using Bonferroni method.
+#. Select features that are significant (p-value lower than 0.05).
+#. Select significant features that overlap in at least half of the refits and
+   3 out of 4 architectures. These    features are associated with the
+   perturbed variable.
+
+MOVE Bayes
+^^^^^^^^^^
+
+#. Perturb a variable in one dataset.
+#. Repeat 30 times:
+
+    #. Train VAE model with original data.
+    #. Obtain reconstruction of original data (baseline reconstruction).
+    #. Obtain reconstruction of perturbed data (perturbed reconstruction).
+    #. Record difference between baseline and perturbed reconstruction.
+
+#. Compute probability of difference being greater than 0.
+#. Compute Bayes factor from probability: :math:`K = \log p - \log (1 - p)`.
+#. Sort probabilities by Bayes factor, from highest to lowest.
+#. Compute false discovery rate (FDR) as cumulative evidence.
+#. Select features whose FDR is above desired threshold (e.g., 0.05). These
+   features are associated with the perturbed variable.
\ No newline at end of file