MOVE / Git / Diff of /tutorial/README.md

Models:
AlyssaS/
MOVE
Downloads: 1
Data:
Tabular
Time Series Specialty:
Endocrinology Laboratory:
Blood Tests EHR:
Demographics
Diagnoses
Medications Omics:
Genomics
Multi-omics
Transcriptomics Wearable:
Activity Clinical Purpose:
Treatment Response Assessment Task:
Biomarker Discovery
Diff of /tutorial/README.md [000000] .. [c23b31]
Switch to side-by-side view

--- a
+++ b/tutorial/README.md
@@ -0,0 +1,229 @@
+# Tutorial
+
+## Random Small
+
+We have provided a tutorial. In this first tutorial, we inspect datasets 
+reporting whether 500 fictitious individuals have taken one of 20 imaginary
+drugs. We have included a pair of simulated omics datasets, with measurements
+for each sample (individual). All these measurements were generated randomly,
+but we have added 200 associations between different pairs of drugs and omics
+features. Let us find them with MOVE!
+
+### Workspace structure
+
+First, we take a look at how to organize our data and configuration:
+
+
+```
+tutorial/
+│
+├── data/
+│   ├── changes.small.txt              <- Ground-truth associations (200 links)
+│   ├── random.small.drugs.tsv         <- Drug dataset (20 drugs)
+│   ├── random.small.ids.tsv           <- Sample IDs (500 samples)
+│   ├── random.small.proteomics.tsv    <- Proteomics dataset (200 proteins)
+│   └── random.small.metagenomics.tsv  <- Metagenomics dataset (1000 taxa)
+│
+└── config/                            <- Stores user configuration files
+    ├── data/
+    │   └── random_small.yaml          <- Configuration to read in the necessary
+    │                                     data files.
+    ├── experiment/                    <- Configuration for experiments (e.g.,
+    │   └── random_small__tune.yaml       for tuning hyperparameters).
+    │
+    └── task/                          <- Configuration for tasks: such as
+        |                                 latent space or identify associations
+        │                                 using the t-test or Bayesian approach
+        ├── random_small__id_assoc_bayes.yaml
+        ├── random_small__id_assoc_ttest.yaml
+        └── random_small__latent.yaml
+```
+
+#### The data folder
+
+All "raw" data files should be placed inside the same directory. These files
+are TSVs (tab-separated value tables) containing discrete values (e.g., for
+binary or categorical datasets) or continuous values.
+
+Additionally, make sure each sample has an assigned ID and we provide an ID
+table containing a list of all valid IDs (must appear in every dataset).
+
+#### The `config` folder
+
+User-defined configuration must be stored in a `config` folder. This folder
+can contain a `data` and `task` folder to store the configuration for a
+specific dataset or task.
+
+Let us take a look at the configuration for our dataset. It is a YAML file,
+specifying: a default layout\*, the directories to look for raw data and store
+intermediary and final output files, and the list of categorical and continuous
+datasets we have.
+
+```yaml
+# DO NOT EDIT
+
+defaults:
+  - base_data
+
+# FEEL FREE TO EDIT BELOW
+
+raw_data_path: data/              # where raw data is stored
+interim_data_path: interim_data/  # where intermediate files will be stored
+results_path: results/     # where result files will be placed
+
+sample_names: random.small.ids  # names/IDs of each sample, must appear in the
+                                # other datasets
+
+categorical_inputs:  # a list of categorical datasets and their weights
+  - name: random.small.drugs
+
+continuous_inputs:   # a list of continuous-valued datasets and their weights
+  - name: random.small.proteomics
+  - name: random.small.metagenomics
+```
+
+<span style="font-size: 0.8em">\* We do not recommend changing `defaults`
+unless you are sure of what you are doing.</span>
+
+Similarly, the `task` folder contains YAML files to configure the tasks of
+MOVE. In this tutorial, we provided two examples for running the method to
+identify associations using our t-test and Bayesian approach, and an example to
+perform latent space analysis.
+
+For example, for the t-test approach (`random_small__id_assoc_ttest.yaml`), we
+define the following values: batch size, number of refits, name of dataset to
+perturb, target perturb value, configuration for VAE model, and configuration
+for training loop.
+
+```yaml
+
+defaults:
+  - identify_associations_ttest
+
+batch_size: 10  # number of samples per batch in training loop
+
+num_refits: 10  # number of times to refit (retrain) model
+
+target_dataset: random.small.drugs  # dataset to perturb
+target_value: 1                     # value to change to
+save_refits: True                   # whether to save refits to interim folder
+
+model:         # model configuration
+  num_hidden:  # list of units in each hidden layer of the VAE encoder/decoder
+    - 1000
+
+training_loop:    # training loop configuration
+  lr: 1e-4        # learning rate
+  num_epochs: 40  # number of epochs
+
+```
+
+Note that the `random_small__id_assoc_bayes.yaml` looks pretty similar, but
+declares a different `defaults`. This tells MOVE which algorithm to use!
+
+### Running MOVE
+
+#### Encoding data
+
+Make sure you are on the parent directory of the `config` folder (in this
+example, it is the `tutorial` folder), and proceed to run:
+
+```bash
+>>> cd tutorial
+>>> move-dl data=random_small task=encode_data
+```
+
+:arrow_up: This command will encode the datasets. The `random.small.drugs`
+dataset (defined in `config/data/random_small.yaml`) will be one-hot encoded,
+whereas the other two omics datasets will be standardized.
+
+#### Analyzing the latent space
+
+Next, we will train a variational autoencoder and analyze how good it is at
+reconstructing our input data and generating an informative latent space. Run:
+
+```bash
+>>> move-dl data=random_small task=random_small__latent
+```
+
+:arrow_up: This command will create four types of plot in the `results/latent_space` folder:
+
+- Loss curve shows the overall loss and each of it's three components:
+  Kullback-Leiber-Divergence (KLD) term, binary cross-entropy term,
+  and sum of squared errors term over number of training epochs.
+- Reconstructions metrics boxplot shows a score (accuracy or cosine similarity
+for categorical and continuous datasets, respectively) per reconstructed
+dataset.
+- Latent space scatterplot shows a reduced representation of the latent space.
+To generate this visualization, the latent space is reduced to two dimensions 
+using TSNE (or another user-defined algorithm, e.g., UMAP).
+- Feature importance swarmplot displays the impact perturbing a feature has on
+the latent space.
+
+Additionally, TSV files corresponding to each plot will be generated. These can
+be used, for example, to re-create the plots manually.
+
+#### Identifying associations
+
+Next step is to find associations between the drugs taken by each individual
+and the omics features. Run:
+
+```bash
+>>> move-dl data=random_small task=random_small__id_assoc_ttest
+```
+
+:arrow_up: This command will create a `results_sig_assoc.tsv` 
+file in `results/identify_asscociations`, listing
+each pair of associated features and the corresponding median p-value for such
+association. There should be ~120 associations found.
+
+:warning: Note that the value after `task=` matches the name of our
+configuration file. We can create multiple configuration files (for example,
+changing hyperparameters like learning rate) and call them by their name here.
+
+If you want to run, the Bayesian approach instead. Run:
+
+```bash
+>>> move-dl data=random_small task=random_small__id_assoc_bayes
+```
+Again, it should generate similar results with over 100 associations known.
+
+Take a look at the `changes.small.txt` file and compare your results against
+it. Did MOVE find any false positives?
+
+#### Tuning the model's hyperparameters
+
+Additionally, we can improve the reconstructions generated by MOVE by running
+a grid search over a set of hyperparameters.
+
+We define the hyperparameters we want to sweep in an `experiment` config file,
+such as:
+
+```yaml
+# @package _global_
+
+# Define the default configuration for the data and task (model and training)
+
+defaults:
+  - override /data: random_small
+  - override /task: tune_model
+
+# Configure which hyperarameters to vary
+# This will run and log the metrics of 12 models (combination of 3 hyperparams
+# with 2-3 levels: 2 * 2 * 3)
+
+# Any field defined in the task configuration can be configured below.
+
+hydra:
+  mode: MULTIRUN
+  sweeper:
+    params:
+      task.batch_size: 10, 50
+      task.model.num_hidden: "[500],[1000]"
+      task.training_loop.num_epochs: 40, 60, 100
+```
+
+The above configuration file will generate different combinations of batch size,
+hidden layer size, and training epochs. Then each model will run with one of 
+these combinations and the reconstructions metrics will be recorded in a TSV
+file.