a b/README.md
1
# MOSA - Multi-omic Synthetic Augmentation
2
3
This repository presents a bespoke Variational Autoencoder (VAE) that integrates all molecular and phenotypic data sets available for cancer cell lines.
4
5
![MOSA Overview](./figure/MOSA_Overview.png)
6
7
## Installation
8
### Instruction
9
1. Clone this repository
10
2. Create a python (Python 3.10) environment: e.g. `conda create -n mosa python=3.10`
11
3. Activate the python environment: `conda activate mosa`
12
4. Run `pip install -r requirements.txt`
13
5. Install shap from `https://github.com/ZhaoxiangSimonCai/shap`, which is customised to support the data format in MOSA.
14
5. Run `pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu118`
15
16
17
### Typical installation time
18
The installation time largely depends on the internet speed as packages need to be downloaded and installed over the internet. Typically the installation should take less than 10 minutes.
19
20
## Demo
21
### Instruction
22
1. Download data files from figshare repository (see links in the manuscript)
23
2. Configure the paths of the data files in `reports/vae/files/hyperparameters.json`
24
3. Run MOSA with `python PhenPred/vae/Main.py`
25
### Expected output
26
The expected output, including the latent space matrix and reconstructed data matrices, can be downloaded from the figshare repository as described in the paper.
27
### Expected runtime
28
As a deep learning-based method, the runtime of MOSA depends on whether a GPU is available for training. MOSA took 52 minutes to train and generate the results using a V100 GPU on the DepMap dataset.
29
30
## Instructions for using MOFA with custom data
31
Although MOSA is specifically designed for analysing the DepMap dataset, the model can be  adapted for any multi-omic datasets. To use MOSA with custom datasets:
32
1. Prepare the custom dataset following the formats of DepMap data, which can be downloaded from figshare repositories as described in the manuscript.
33
2. Configure the paths of the data files in `reports/vae/files/hyperparameters.json`. At least two omic datasets are required.
34
3. Run MOSA with `python PhenPred/vae/Main.py`
35
4. If certain benchmark analysis cannot be run properly, MOSA can be run by setting `skip_benchmarks=true` in the  `hyperparameters.json` to only save the output data, which includes the integrated latent space matrix and reconstructed data for each omics.
36
5. To further customise data pre-processing, the user can create their own dataset following the style of `PhenPred/vae/DatasetDepMap23Q2.py`, and the use the custome dataset class in the `Main.py`.
37
38
## Reproduction instructions
39
### To reproduce the benchmark results
40
1. Download the data from [figshare](https://doi.org/10.6084/m9.figshare.24562765)
41
2. Place the downloaded files to `reports/vae/files/` 
42
3. In the `Main.py`, configure to run MOSA from pre-computed data ` hyperparameters = Hypers.read_hyperparameters(timestamp="20231023_092657")`.
43
44
### To reproduce from scratch
45
1. Directly run MOSA with the default configurations as described above.
46
47
## Instructions for Integrating Disentanglement Learning into MOSA
48
To incorporate disentanglement learning, two additional terms are included in the loss function, following the Disentangled Inferred Prior Variational Autoencoder (DIP-VAE) approach, as described by [Kumar et al. (2018)](https://arxiv.org/abs/1711.00848):
49
50
![DIP-VAE loss term](./figure/dipvae_lossterm.png)
51
52
To use this, update the `hyperparameters.json` file by specifying `dip_vae_type` as either `"i"` or `"ii"` (type ii is recommended), and define the parameters `lambda_d` and `lambda_od` as float values, which control the diagonal and off-diagonal regularization, respectively.
53
54
## Pre-trained models
55
The pre-trained models can be downloaded from the Hugging Face model hub: [MOSA](https://huggingface.co/QuantitativeBiology/MOSA_pretrained)
56
57
## Citation
58
Cai, Z et al., Synthetic multi-omics augmentation of cancer cell lines using unsupervised deep learning, 2023
59