Data: Tabular Time Series Specialty: Endocrinology Laboratory: Blood Tests EHR: Demographics Diagnoses Medications Omics: Genomics Multi-omics Transcriptomics Wearable: Activity Clinical Purpose: Treatment Response Assessment Task: Biomarker Discovery

Switch to side-by-side view

--- a
+++ b/docs/source/tutorial/introduction.rst
@@ -0,0 +1,210 @@
+Introduction to MOVE
+====================
+
+This guide can help you get started with MOVE.
+
+About the MOVE pipeline
+-----------------------
+
+MOVE has the following four steps (also called tasks):
+
+1. **Encoding data**: taking input data and converting it into a format MOVE
+   can read.
+2. **Tuning the hyperparameters of the VAE model**: training multiple models to
+   find the set of hyperparameters that produces the best reconstructions or
+   most stable latent space.
+3. **Analyzing the latent space**: training a model and inspecting the latent
+   representation it creates.
+4. **Identifying associations**: using an ensemble of VAE to find associations
+   between the input datasets.
+
+Simulated dataset
+-----------------
+
+For this short tutorial, we provide `simulated dataset`_ (available from our
+GitHub repository). This dataset consists of pretend proteomics and
+metagenomcis measurements for 500 fictitious individuals. We also report
+whether these individuals have taken or not 20 imaginary drugs.
+
+All values were randomly generated, but we have added 200
+associations between different pairs of drugs and omics features. Let us find
+these associations with MOVE!
+
+.. _simulated dataset: https://download-directory.github.io/?url=https%3A%2F%2Fgithub.com%2FRasmussenLab%2FMOVE%2Ftree%2Fmain%2Ftutorial
+
+Workspace structure
+-------------------
+
+First, we take a look at how to organize our data and configuration:::
+
+    tutorial/
+    │
+    ├── data/
+    │   ├── changes.small.txt              <- Ground-truth associations (200 links)
+    │   ├── random.small.drugs.tsv         <- Drug dataset (20 drugs)
+    │   ├── random.small.ids.tsv           <- Sample IDs (500 samples)
+    │   ├── random.small.proteomics.tsv    <- Proteomics dataset (200 proteins)
+    │   └── random.small.metagenomics.tsv  <- Metagenomics dataset (1000 taxa)
+    │
+    └── config/                            <- Stores user configuration files
+        ├── data/
+        │   └── random_small.yaml          <- Configuration to read in the necessary
+        │                                     data files.
+        ├── experiment/                    <- Configuration for experiments (e.g.,
+        │   └── random_small__tune.yaml       for tuning hyperparameters).
+        │
+        └── task/                          <- Configuration for tasks: such as
+            |                                 latent space or identify associations
+            │                                 using the t-test or Bayesian approach
+            ├── random_small__id_assoc_bayes.yaml
+            ├── random_small__id_assoc_ttest.yaml
+            └── random_small__latent.yaml
+
+The data directory
+^^^^^^^^^^^^^^^^^^
+
+All "raw" data files should be placed inside the same directory. These files
+are TSVs (tab-separated value tables) containing discrete values (e.g., for
+binary or categorical datasets) or continuous values.
+
+Additionally, make sure each sample has an assigned ID and provide an ID
+table containing a list of all valid IDs (must appear in every dataset).
+
+The ``config`` directory
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Configuration is composed and managed by `Hydra`_.
+
+User-defined configuration must be stored in a ``config`` folder. This folder
+can contain a ``data`` and ``task`` folder to store the configuration files for
+your dataset and tasks.
+
+.. _`Hydra`: https://hydra.cc/
+
+Data configuration
+""""""""""""""""""
+
+Let us take a look at the configuration for our dataset. It is a YAML file,
+specifying: the directories to look for raw data and store intermediary and
+final output files, and the list of categorical and continuous datasets we
+have.
+
+.. literalinclude:: /../../tutorial/config/data/random_small.yaml
+    :language: yaml
+
+Note that we do not recommend changing the ``defaults`` field, otherwise the
+configuration file will not be properly recognized by MOVE.
+
+Task configuration
+""""""""""""""""""
+
+Similarly, the ``task`` folder contains YAML files to configure the tasks of
+MOVE. In this tutorial, we provided two examples for running the method to
+identify associations using our t-test and Bayesian approach, and an example to
+perform latent space analysis.
+
+For example, for the t-test approach (``random_small__id_assoc_ttest.yaml``),
+we define the following values: batch size, number of refits, name of dataset to
+perturb, target perturb value, configuration for VAE model, and configuration
+for training loop.
+
+.. literalinclude:: /../../tutorial/config/task/random_small__id_assoc_ttest.yaml
+    :language: yaml
+
+Note that the ``random_small__id_assoc_bayes.yaml`` looks pretty similar, but
+declares a different ``defaults``. This tells MOVE which algorithm to use!
+
+Running MOVE
+------------
+
+Encoding data
+^^^^^^^^^^^^^
+
+Make sure you are on the parent directory of the ``config`` folder (in this
+example, it is the ``tutorial`` folder), and proceed to run:
+
+.. code-block:: bash
+
+    >>> cd tutorial
+    >>> move-dl data=random_small task=encode_data
+
+|:arrow_up:| This command will encode the datasets. The ``random.small.drugs``
+dataset (defined in ``config/data/random_small.yaml``) will be one-hot encoded,
+whereas the other two omics datasets will be standardized. Encoded data will
+be placed in the intermediary folder defined in the
+:ref:`data config<Data configuration>`.
+
+|:loud_sound:| Every ``move-dl`` command will generate a ``logs`` folder to
+store log files timestamping the program's current doings.
+
+Tuning the model's hyperparameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Once the data has been encoded, we can proceed with the first step of our
+pipeline: tuning the hyperparameters of our deep learning model. This process
+can be time-consuming, because several models will be trained and tested. For
+this short tutorial, you may choose to skip it and proceed to
+:ref:`analyze the latent space<Analyzing the latent space>`.
+
+Analyzing the latent space
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Next, we will train a variational autoencoder and analyze how good it is at
+reconstructing our input data and generating an informative latent space. Run:
+
+.. code-block:: bash
+
+    >>> move-dl data=random_small task=random_small__latent
+
+|:arrow_up:| This command will create a ``latent_space`` directory in the
+results folder defined in the :ref:`data config<Data configuration>`. This
+folder will contain the following plots:
+
+* **Loss curve** shows the overall loss, KLD term, binary cross-entropy term,
+  and sum of squared errors term over number of training epochs.
+* **Reconstructions metrics boxplot** shows a score (accuracy or cosine
+  similarity for categorical and continuous datasets, respectively) per
+  reconstructed dataset.
+* **Latent space scatterplot** shows a reduced representation of the latent
+  space. To generate this visualization, the latent space is reduced to two
+  dimensions using TSNE (or another user-defined algorithm, e.g., UMAP).
+* **Feature importance swarmplot** displays the impact perturbing a feature has
+  on the latent space.
+
+Additionally, TSV files corresponding to each plot will be generated. These can
+be used, for example, to re-create the plots manually or with different
+styling.
+
+Identifying associations
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Next step is to find associations between the drugs taken by each individual
+and the omics features. Run:
+
+.. code-block:: bash
+
+    >>> move-dl data=random_small task=random_small__id_assoc_ttest
+
+|:arrow_up:| This command will create a ``results_sig_assoc.tsv`` file, listing
+each pair of associated features and the corresponding median p-value for such
+association. There should be ~120 associations found. Due to the nature of the
+method, this number may slightly fluctuate.
+
+|:warning:| Note that the value after ``task=`` matches the name of our
+configuration file. We can create multiple configuration files (for example,
+changing hyperparameters like learning rate) and call them by their name here.
+
+|:stopwatch:| This command takes approximately 45 min to run on a work laptop
+(Intel Core i7-10610U @ 1.80 GHz, 32 GB RAM). You can track the progress by
+checking the corresponding log file created in the ``logs`` folder.
+
+If you want to try the Bayesian approach instead, run:
+
+.. code-block:: bash
+
+    >>> move-dl data=random_small task=random_small__id_assoc_bayes
+
+Again, it should generate similar results with over 120 associations known.
+
+Take a look at the ``changes.small.txt`` file and compare your results against
+it. Did MOVE find any false positives?