|
a |
|
b/tutorial/README.md |
|
|
1 |
# Tutorial |
|
|
2 |
|
|
|
3 |
## Random Small |
|
|
4 |
|
|
|
5 |
We have provided a tutorial. In this first tutorial, we inspect datasets |
|
|
6 |
reporting whether 500 fictitious individuals have taken one of 20 imaginary |
|
|
7 |
drugs. We have included a pair of simulated omics datasets, with measurements |
|
|
8 |
for each sample (individual). All these measurements were generated randomly, |
|
|
9 |
but we have added 200 associations between different pairs of drugs and omics |
|
|
10 |
features. Let us find them with MOVE! |
|
|
11 |
|
|
|
12 |
### Workspace structure |
|
|
13 |
|
|
|
14 |
First, we take a look at how to organize our data and configuration: |
|
|
15 |
|
|
|
16 |
|
|
|
17 |
``` |
|
|
18 |
tutorial/ |
|
|
19 |
│ |
|
|
20 |
├── data/ |
|
|
21 |
│ ├── changes.small.txt <- Ground-truth associations (200 links) |
|
|
22 |
│ ├── random.small.drugs.tsv <- Drug dataset (20 drugs) |
|
|
23 |
│ ├── random.small.ids.tsv <- Sample IDs (500 samples) |
|
|
24 |
│ ├── random.small.proteomics.tsv <- Proteomics dataset (200 proteins) |
|
|
25 |
│ └── random.small.metagenomics.tsv <- Metagenomics dataset (1000 taxa) |
|
|
26 |
│ |
|
|
27 |
└── config/ <- Stores user configuration files |
|
|
28 |
├── data/ |
|
|
29 |
│ └── random_small.yaml <- Configuration to read in the necessary |
|
|
30 |
│ data files. |
|
|
31 |
├── experiment/ <- Configuration for experiments (e.g., |
|
|
32 |
│ └── random_small__tune.yaml for tuning hyperparameters). |
|
|
33 |
│ |
|
|
34 |
└── task/ <- Configuration for tasks: such as |
|
|
35 |
| latent space or identify associations |
|
|
36 |
│ using the t-test or Bayesian approach |
|
|
37 |
├── random_small__id_assoc_bayes.yaml |
|
|
38 |
├── random_small__id_assoc_ttest.yaml |
|
|
39 |
└── random_small__latent.yaml |
|
|
40 |
``` |
|
|
41 |
|
|
|
42 |
#### The data folder |
|
|
43 |
|
|
|
44 |
All "raw" data files should be placed inside the same directory. These files |
|
|
45 |
are TSVs (tab-separated value tables) containing discrete values (e.g., for |
|
|
46 |
binary or categorical datasets) or continuous values. |
|
|
47 |
|
|
|
48 |
Additionally, make sure each sample has an assigned ID and we provide an ID |
|
|
49 |
table containing a list of all valid IDs (must appear in every dataset). |
|
|
50 |
|
|
|
51 |
#### The `config` folder |
|
|
52 |
|
|
|
53 |
User-defined configuration must be stored in a `config` folder. This folder |
|
|
54 |
can contain a `data` and `task` folder to store the configuration for a |
|
|
55 |
specific dataset or task. |
|
|
56 |
|
|
|
57 |
Let us take a look at the configuration for our dataset. It is a YAML file, |
|
|
58 |
specifying: a default layout\*, the directories to look for raw data and store |
|
|
59 |
intermediary and final output files, and the list of categorical and continuous |
|
|
60 |
datasets we have. |
|
|
61 |
|
|
|
62 |
```yaml |
|
|
63 |
# DO NOT EDIT |
|
|
64 |
|
|
|
65 |
defaults: |
|
|
66 |
- base_data |
|
|
67 |
|
|
|
68 |
# FEEL FREE TO EDIT BELOW |
|
|
69 |
|
|
|
70 |
raw_data_path: data/ # where raw data is stored |
|
|
71 |
interim_data_path: interim_data/ # where intermediate files will be stored |
|
|
72 |
results_path: results/ # where result files will be placed |
|
|
73 |
|
|
|
74 |
sample_names: random.small.ids # names/IDs of each sample, must appear in the |
|
|
75 |
# other datasets |
|
|
76 |
|
|
|
77 |
categorical_inputs: # a list of categorical datasets and their weights |
|
|
78 |
- name: random.small.drugs |
|
|
79 |
|
|
|
80 |
continuous_inputs: # a list of continuous-valued datasets and their weights |
|
|
81 |
- name: random.small.proteomics |
|
|
82 |
- name: random.small.metagenomics |
|
|
83 |
``` |
|
|
84 |
|
|
|
85 |
<span style="font-size: 0.8em">\* We do not recommend changing `defaults` |
|
|
86 |
unless you are sure of what you are doing.</span> |
|
|
87 |
|
|
|
88 |
Similarly, the `task` folder contains YAML files to configure the tasks of |
|
|
89 |
MOVE. In this tutorial, we provided two examples for running the method to |
|
|
90 |
identify associations using our t-test and Bayesian approach, and an example to |
|
|
91 |
perform latent space analysis. |
|
|
92 |
|
|
|
93 |
For example, for the t-test approach (`random_small__id_assoc_ttest.yaml`), we |
|
|
94 |
define the following values: batch size, number of refits, name of dataset to |
|
|
95 |
perturb, target perturb value, configuration for VAE model, and configuration |
|
|
96 |
for training loop. |
|
|
97 |
|
|
|
98 |
```yaml |
|
|
99 |
|
|
|
100 |
defaults: |
|
|
101 |
- identify_associations_ttest |
|
|
102 |
|
|
|
103 |
batch_size: 10 # number of samples per batch in training loop |
|
|
104 |
|
|
|
105 |
num_refits: 10 # number of times to refit (retrain) model |
|
|
106 |
|
|
|
107 |
target_dataset: random.small.drugs # dataset to perturb |
|
|
108 |
target_value: 1 # value to change to |
|
|
109 |
save_refits: True # whether to save refits to interim folder |
|
|
110 |
|
|
|
111 |
model: # model configuration |
|
|
112 |
num_hidden: # list of units in each hidden layer of the VAE encoder/decoder |
|
|
113 |
- 1000 |
|
|
114 |
|
|
|
115 |
training_loop: # training loop configuration |
|
|
116 |
lr: 1e-4 # learning rate |
|
|
117 |
num_epochs: 40 # number of epochs |
|
|
118 |
|
|
|
119 |
``` |
|
|
120 |
|
|
|
121 |
Note that the `random_small__id_assoc_bayes.yaml` looks pretty similar, but |
|
|
122 |
declares a different `defaults`. This tells MOVE which algorithm to use! |
|
|
123 |
|
|
|
124 |
### Running MOVE |
|
|
125 |
|
|
|
126 |
#### Encoding data |
|
|
127 |
|
|
|
128 |
Make sure you are on the parent directory of the `config` folder (in this |
|
|
129 |
example, it is the `tutorial` folder), and proceed to run: |
|
|
130 |
|
|
|
131 |
```bash |
|
|
132 |
>>> cd tutorial |
|
|
133 |
>>> move-dl data=random_small task=encode_data |
|
|
134 |
``` |
|
|
135 |
|
|
|
136 |
:arrow_up: This command will encode the datasets. The `random.small.drugs` |
|
|
137 |
dataset (defined in `config/data/random_small.yaml`) will be one-hot encoded, |
|
|
138 |
whereas the other two omics datasets will be standardized. |
|
|
139 |
|
|
|
140 |
#### Analyzing the latent space |
|
|
141 |
|
|
|
142 |
Next, we will train a variational autoencoder and analyze how good it is at |
|
|
143 |
reconstructing our input data and generating an informative latent space. Run: |
|
|
144 |
|
|
|
145 |
```bash |
|
|
146 |
>>> move-dl data=random_small task=random_small__latent |
|
|
147 |
``` |
|
|
148 |
|
|
|
149 |
:arrow_up: This command will create four types of plot in the `results/latent_space` folder: |
|
|
150 |
|
|
|
151 |
- Loss curve shows the overall loss and each of it's three components: |
|
|
152 |
Kullback-Leiber-Divergence (KLD) term, binary cross-entropy term, |
|
|
153 |
and sum of squared errors term over number of training epochs. |
|
|
154 |
- Reconstructions metrics boxplot shows a score (accuracy or cosine similarity |
|
|
155 |
for categorical and continuous datasets, respectively) per reconstructed |
|
|
156 |
dataset. |
|
|
157 |
- Latent space scatterplot shows a reduced representation of the latent space. |
|
|
158 |
To generate this visualization, the latent space is reduced to two dimensions |
|
|
159 |
using TSNE (or another user-defined algorithm, e.g., UMAP). |
|
|
160 |
- Feature importance swarmplot displays the impact perturbing a feature has on |
|
|
161 |
the latent space. |
|
|
162 |
|
|
|
163 |
Additionally, TSV files corresponding to each plot will be generated. These can |
|
|
164 |
be used, for example, to re-create the plots manually. |
|
|
165 |
|
|
|
166 |
#### Identifying associations |
|
|
167 |
|
|
|
168 |
Next step is to find associations between the drugs taken by each individual |
|
|
169 |
and the omics features. Run: |
|
|
170 |
|
|
|
171 |
```bash |
|
|
172 |
>>> move-dl data=random_small task=random_small__id_assoc_ttest |
|
|
173 |
``` |
|
|
174 |
|
|
|
175 |
:arrow_up: This command will create a `results_sig_assoc.tsv` |
|
|
176 |
file in `results/identify_asscociations`, listing |
|
|
177 |
each pair of associated features and the corresponding median p-value for such |
|
|
178 |
association. There should be ~120 associations found. |
|
|
179 |
|
|
|
180 |
:warning: Note that the value after `task=` matches the name of our |
|
|
181 |
configuration file. We can create multiple configuration files (for example, |
|
|
182 |
changing hyperparameters like learning rate) and call them by their name here. |
|
|
183 |
|
|
|
184 |
If you want to run, the Bayesian approach instead. Run: |
|
|
185 |
|
|
|
186 |
```bash |
|
|
187 |
>>> move-dl data=random_small task=random_small__id_assoc_bayes |
|
|
188 |
``` |
|
|
189 |
Again, it should generate similar results with over 100 associations known. |
|
|
190 |
|
|
|
191 |
Take a look at the `changes.small.txt` file and compare your results against |
|
|
192 |
it. Did MOVE find any false positives? |
|
|
193 |
|
|
|
194 |
#### Tuning the model's hyperparameters |
|
|
195 |
|
|
|
196 |
Additionally, we can improve the reconstructions generated by MOVE by running |
|
|
197 |
a grid search over a set of hyperparameters. |
|
|
198 |
|
|
|
199 |
We define the hyperparameters we want to sweep in an `experiment` config file, |
|
|
200 |
such as: |
|
|
201 |
|
|
|
202 |
```yaml |
|
|
203 |
# @package _global_ |
|
|
204 |
|
|
|
205 |
# Define the default configuration for the data and task (model and training) |
|
|
206 |
|
|
|
207 |
defaults: |
|
|
208 |
- override /data: random_small |
|
|
209 |
- override /task: tune_model |
|
|
210 |
|
|
|
211 |
# Configure which hyperarameters to vary |
|
|
212 |
# This will run and log the metrics of 12 models (combination of 3 hyperparams |
|
|
213 |
# with 2-3 levels: 2 * 2 * 3) |
|
|
214 |
|
|
|
215 |
# Any field defined in the task configuration can be configured below. |
|
|
216 |
|
|
|
217 |
hydra: |
|
|
218 |
mode: MULTIRUN |
|
|
219 |
sweeper: |
|
|
220 |
params: |
|
|
221 |
task.batch_size: 10, 50 |
|
|
222 |
task.model.num_hidden: "[500],[1000]" |
|
|
223 |
task.training_loop.num_epochs: 40, 60, 100 |
|
|
224 |
``` |
|
|
225 |
|
|
|
226 |
The above configuration file will generate different combinations of batch size, |
|
|
227 |
hidden layer size, and training epochs. Then each model will run with one of |
|
|
228 |
these combinations and the reconstructions metrics will be recorded in a TSV |
|
|
229 |
file. |