Diff of /README.md [000000] .. [305123]

Switch to side-by-side view

--- a
+++ b/README.md
@@ -0,0 +1,59 @@
+# MOSA - Multi-omic Synthetic Augmentation
+
+This repository presents a bespoke Variational Autoencoder (VAE) that integrates all molecular and phenotypic data sets available for cancer cell lines.
+
+![MOSA Overview](./figure/MOSA_Overview.png)
+
+## Installation
+### Instruction
+1. Clone this repository
+2. Create a python (Python 3.10) environment: e.g. `conda create -n mosa python=3.10`
+3. Activate the python environment: `conda activate mosa`
+4. Run `pip install -r requirements.txt`
+5. Install shap from `https://github.com/ZhaoxiangSimonCai/shap`, which is customised to support the data format in MOSA.
+5. Run `pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu118`
+
+
+### Typical installation time
+The installation time largely depends on the internet speed as packages need to be downloaded and installed over the internet. Typically the installation should take less than 10 minutes.
+
+## Demo
+### Instruction
+1. Download data files from figshare repository (see links in the manuscript)
+2. Configure the paths of the data files in `reports/vae/files/hyperparameters.json`
+3. Run MOSA with `python PhenPred/vae/Main.py`
+### Expected output
+The expected output, including the latent space matrix and reconstructed data matrices, can be downloaded from the figshare repository as described in the paper.
+### Expected runtime
+As a deep learning-based method, the runtime of MOSA depends on whether a GPU is available for training. MOSA took 52 minutes to train and generate the results using a V100 GPU on the DepMap dataset.
+
+## Instructions for using MOFA with custom data
+Although MOSA is specifically designed for analysing the DepMap dataset, the model can be  adapted for any multi-omic datasets. To use MOSA with custom datasets:
+1. Prepare the custom dataset following the formats of DepMap data, which can be downloaded from figshare repositories as described in the manuscript.
+2. Configure the paths of the data files in `reports/vae/files/hyperparameters.json`. At least two omic datasets are required.
+3. Run MOSA with `python PhenPred/vae/Main.py`
+4. If certain benchmark analysis cannot be run properly, MOSA can be run by setting `skip_benchmarks=true` in the  `hyperparameters.json` to only save the output data, which includes the integrated latent space matrix and reconstructed data for each omics.
+5. To further customise data pre-processing, the user can create their own dataset following the style of `PhenPred/vae/DatasetDepMap23Q2.py`, and the use the custome dataset class in the `Main.py`.
+
+## Reproduction instructions
+### To reproduce the benchmark results
+1. Download the data from [figshare](https://doi.org/10.6084/m9.figshare.24562765)
+2. Place the downloaded files to `reports/vae/files/` 
+3. In the `Main.py`, configure to run MOSA from pre-computed data ` hyperparameters = Hypers.read_hyperparameters(timestamp="20231023_092657")`.
+
+### To reproduce from scratch
+1. Directly run MOSA with the default configurations as described above.
+
+## Instructions for Integrating Disentanglement Learning into MOSA
+To incorporate disentanglement learning, two additional terms are included in the loss function, following the Disentangled Inferred Prior Variational Autoencoder (DIP-VAE) approach, as described by [Kumar et al. (2018)](https://arxiv.org/abs/1711.00848):
+
+![DIP-VAE loss term](./figure/dipvae_lossterm.png)
+
+To use this, update the `hyperparameters.json` file by specifying `dip_vae_type` as either `"i"` or `"ii"` (type ii is recommended), and define the parameters `lambda_d` and `lambda_od` as float values, which control the diagonal and off-diagonal regularization, respectively.
+
+## Pre-trained models
+The pre-trained models can be downloaded from the Hugging Face model hub: [MOSA](https://huggingface.co/QuantitativeBiology/MOSA_pretrained)
+
+## Citation
+Cai, Z et al., Synthetic multi-omics augmentation of cancer cell lines using unsupervised deep learning, 2023
+