Data: Tabular Time Series Specialty: Endocrinology Laboratory: Blood Tests EHR: Demographics Diagnoses Medications Omics: Genomics Multi-omics Transcriptomics Wearable: Activity Clinical Purpose: Treatment Response Assessment Task: Biomarker Discovery
Diff of /tutorial/README.md [000000] .. [c23b31]

Switch to unified view

a b/tutorial/README.md
1
# Tutorial
2
3
## Random Small
4
5
We have provided a tutorial. In this first tutorial, we inspect datasets 
6
reporting whether 500 fictitious individuals have taken one of 20 imaginary
7
drugs. We have included a pair of simulated omics datasets, with measurements
8
for each sample (individual). All these measurements were generated randomly,
9
but we have added 200 associations between different pairs of drugs and omics
10
features. Let us find them with MOVE!
11
12
### Workspace structure
13
14
First, we take a look at how to organize our data and configuration:
15
16
17
```
18
tutorial/
19
20
├── data/
21
│   ├── changes.small.txt              <- Ground-truth associations (200 links)
22
│   ├── random.small.drugs.tsv         <- Drug dataset (20 drugs)
23
│   ├── random.small.ids.tsv           <- Sample IDs (500 samples)
24
│   ├── random.small.proteomics.tsv    <- Proteomics dataset (200 proteins)
25
│   └── random.small.metagenomics.tsv  <- Metagenomics dataset (1000 taxa)
26
27
└── config/                            <- Stores user configuration files
28
    ├── data/
29
    │   └── random_small.yaml          <- Configuration to read in the necessary
30
    │                                     data files.
31
    ├── experiment/                    <- Configuration for experiments (e.g.,
32
    │   └── random_small__tune.yaml       for tuning hyperparameters).
33
34
    └── task/                          <- Configuration for tasks: such as
35
        |                                 latent space or identify associations
36
        │                                 using the t-test or Bayesian approach
37
        ├── random_small__id_assoc_bayes.yaml
38
        ├── random_small__id_assoc_ttest.yaml
39
        └── random_small__latent.yaml
40
```
41
42
#### The data folder
43
44
All "raw" data files should be placed inside the same directory. These files
45
are TSVs (tab-separated value tables) containing discrete values (e.g., for
46
binary or categorical datasets) or continuous values.
47
48
Additionally, make sure each sample has an assigned ID and we provide an ID
49
table containing a list of all valid IDs (must appear in every dataset).
50
51
#### The `config` folder
52
53
User-defined configuration must be stored in a `config` folder. This folder
54
can contain a `data` and `task` folder to store the configuration for a
55
specific dataset or task.
56
57
Let us take a look at the configuration for our dataset. It is a YAML file,
58
specifying: a default layout\*, the directories to look for raw data and store
59
intermediary and final output files, and the list of categorical and continuous
60
datasets we have.
61
62
```yaml
63
# DO NOT EDIT
64
65
defaults:
66
  - base_data
67
68
# FEEL FREE TO EDIT BELOW
69
70
raw_data_path: data/              # where raw data is stored
71
interim_data_path: interim_data/  # where intermediate files will be stored
72
results_path: results/     # where result files will be placed
73
74
sample_names: random.small.ids  # names/IDs of each sample, must appear in the
75
                                # other datasets
76
77
categorical_inputs:  # a list of categorical datasets and their weights
78
  - name: random.small.drugs
79
80
continuous_inputs:   # a list of continuous-valued datasets and their weights
81
  - name: random.small.proteomics
82
  - name: random.small.metagenomics
83
```
84
85
<span style="font-size: 0.8em">\* We do not recommend changing `defaults`
86
unless you are sure of what you are doing.</span>
87
88
Similarly, the `task` folder contains YAML files to configure the tasks of
89
MOVE. In this tutorial, we provided two examples for running the method to
90
identify associations using our t-test and Bayesian approach, and an example to
91
perform latent space analysis.
92
93
For example, for the t-test approach (`random_small__id_assoc_ttest.yaml`), we
94
define the following values: batch size, number of refits, name of dataset to
95
perturb, target perturb value, configuration for VAE model, and configuration
96
for training loop.
97
98
```yaml
99
100
defaults:
101
  - identify_associations_ttest
102
103
batch_size: 10  # number of samples per batch in training loop
104
105
num_refits: 10  # number of times to refit (retrain) model
106
107
target_dataset: random.small.drugs  # dataset to perturb
108
target_value: 1                     # value to change to
109
save_refits: True                   # whether to save refits to interim folder
110
111
model:         # model configuration
112
  num_hidden:  # list of units in each hidden layer of the VAE encoder/decoder
113
    - 1000
114
115
training_loop:    # training loop configuration
116
  lr: 1e-4        # learning rate
117
  num_epochs: 40  # number of epochs
118
119
```
120
121
Note that the `random_small__id_assoc_bayes.yaml` looks pretty similar, but
122
declares a different `defaults`. This tells MOVE which algorithm to use!
123
124
### Running MOVE
125
126
#### Encoding data
127
128
Make sure you are on the parent directory of the `config` folder (in this
129
example, it is the `tutorial` folder), and proceed to run:
130
131
```bash
132
>>> cd tutorial
133
>>> move-dl data=random_small task=encode_data
134
```
135
136
:arrow_up: This command will encode the datasets. The `random.small.drugs`
137
dataset (defined in `config/data/random_small.yaml`) will be one-hot encoded,
138
whereas the other two omics datasets will be standardized.
139
140
#### Analyzing the latent space
141
142
Next, we will train a variational autoencoder and analyze how good it is at
143
reconstructing our input data and generating an informative latent space. Run:
144
145
```bash
146
>>> move-dl data=random_small task=random_small__latent
147
```
148
149
:arrow_up: This command will create four types of plot in the `results/latent_space` folder:
150
151
- Loss curve shows the overall loss and each of it's three components:
152
  Kullback-Leiber-Divergence (KLD) term, binary cross-entropy term,
153
  and sum of squared errors term over number of training epochs.
154
- Reconstructions metrics boxplot shows a score (accuracy or cosine similarity
155
for categorical and continuous datasets, respectively) per reconstructed
156
dataset.
157
- Latent space scatterplot shows a reduced representation of the latent space.
158
To generate this visualization, the latent space is reduced to two dimensions 
159
using TSNE (or another user-defined algorithm, e.g., UMAP).
160
- Feature importance swarmplot displays the impact perturbing a feature has on
161
the latent space.
162
163
Additionally, TSV files corresponding to each plot will be generated. These can
164
be used, for example, to re-create the plots manually.
165
166
#### Identifying associations
167
168
Next step is to find associations between the drugs taken by each individual
169
and the omics features. Run:
170
171
```bash
172
>>> move-dl data=random_small task=random_small__id_assoc_ttest
173
```
174
175
:arrow_up: This command will create a `results_sig_assoc.tsv` 
176
file in `results/identify_asscociations`, listing
177
each pair of associated features and the corresponding median p-value for such
178
association. There should be ~120 associations found.
179
180
:warning: Note that the value after `task=` matches the name of our
181
configuration file. We can create multiple configuration files (for example,
182
changing hyperparameters like learning rate) and call them by their name here.
183
184
If you want to run, the Bayesian approach instead. Run:
185
186
```bash
187
>>> move-dl data=random_small task=random_small__id_assoc_bayes
188
```
189
Again, it should generate similar results with over 100 associations known.
190
191
Take a look at the `changes.small.txt` file and compare your results against
192
it. Did MOVE find any false positives?
193
194
#### Tuning the model's hyperparameters
195
196
Additionally, we can improve the reconstructions generated by MOVE by running
197
a grid search over a set of hyperparameters.
198
199
We define the hyperparameters we want to sweep in an `experiment` config file,
200
such as:
201
202
```yaml
203
# @package _global_
204
205
# Define the default configuration for the data and task (model and training)
206
207
defaults:
208
  - override /data: random_small
209
  - override /task: tune_model
210
211
# Configure which hyperarameters to vary
212
# This will run and log the metrics of 12 models (combination of 3 hyperparams
213
# with 2-3 levels: 2 * 2 * 3)
214
215
# Any field defined in the task configuration can be configured below.
216
217
hydra:
218
  mode: MULTIRUN
219
  sweeper:
220
    params:
221
      task.batch_size: 10, 50
222
      task.model.num_hidden: "[500],[1000]"
223
      task.training_loop.num_epochs: 40, 60, 100
224
```
225
226
The above configuration file will generate different combinations of batch size,
227
hidden layer size, and training epochs. Then each model will run with one of 
228
these combinations and the reconstructions metrics will be recorded in a TSV
229
file.