Diff of /README.md [000000] .. [e72cf6]

Switch to unified view

a b/README.md
1
# moETM: Learning single-cell multi-omic signature of gene regulatory programs by deep embedding topic model
2
3
moETM integrates multiomics data across different experiments or studies with interpretable latent embeddings. For more information, please refer to our [publication](https://www.cell.com/cell-reports-methods/pdf/S2667-2375(23)00207-2.pdf) and [protocol](https://www.sciencedirect.com/science/article/pii/S2666166724002314).
4
5
Please contact (mz335@cornell.edu or zhanghao.duke@gmail.com or yueli@cs.mcgill.ca or few2001@med.cornell.edu) if you have any questions or suggestions.
6
7
8
## Contents ##
9
10
- [Model Overview](#model-overview)
11
- [Installation](#installation)
12
- [Datasets](#datasets)
13
- [Usage](#usage)
14
    - [Data Preparation](#data-preparation)
15
    - [Integration](#moetm)
16
    - [Imputation](#imputation)
17
    - [Inclusion of prior pathway knowledge](#inclusion-of-prior-pathway-knowledge)
18
- [Downstream analysis](#downstream-analysis)
19
20
## Model Overview
21
22
moETM generalizes the widely used variational autoencoder (VAE) to model multi-modal data. Specifically, the encoder in moETM is a fully-connected neural network (NN), which infers topic proportion from multi-omic normalized count vectors for a cell. The decoder in moETM is a linear multi-modal ETM reconstructing the normalized count vectors from the latent topic proportion. The parameters in moETM are learned by maximizing the evidence lower bound (ELBO) of the marginal data likelihood under the framework of amortized variational inference
23
24
![model](./model.png?raw=true "Title")
25
26
a) The moETM model. The left panel represents different modalities. The medium panel represents the fully-connected neural network encoder. Before inputting values into the encoder, column normalization was applied for each modality. The right panel is the linear decoder.
27
b) The moETM clustering process. The topic proportion $\theta$ from the trained encoder can be clustered and visualized in 2D dimension.
28
c) Cross-modality imputation. The missing modality can be imputed using the topic embedding and topic matrix from the reference trained decoder. 
29
d) Downstream qualitative analysis. Transcriptomics, epigenetic, and protein signatures from the trained feature-by-topic matrix and enriched cell types from the trained cell-by-topic matrix can be used for enrichment analysis, embedding visualization, and exploring topic-directed regulatory networks.
30
31
32
## Installation
33
34
Git clone a copy of code:
35
```
36
git clone https://github.com/manqizhou/moETM.git
37
```
38
Create folders to store results:
39
```
40
mkdir data
41
mkdir result_fig
42
mkdir Result
43
mkdir Trained_model
44
```
45
moETM requires several dependencies:
46
47
* [python==3.9.5](https://www.python.org) 
48
* [PyTorch==1.11.0+cu102](https://pytorch.org/) 
49
* [scanpy==1.9.1](https://scanpy.readthedocs.io/en/stable/) 
50
* [anndata==0.8.0](https://anndata.readthedocs.io/en/latest/) 
51
52
53
## Datasets
54
55
There were 7 public datasets included in this study for performance evaluation and model comparison. All the 7 datasets are from publicly available repositories. Among them, 4 datatsets (BMMC1, MSLAC, MKC, MBC) consist of gene expression and chromatin accessibility information and 3 datasets (BMMC2, HWBC, HBIC) include gene expression and surface protein information. Sepcifically, the HBIC dataset was measured from both COVID patients and healthy patients. 
56
57
* [BMMC1 and BMMC2](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE194122) The Bone Marrow Mononuclear Cell datasets from 2021 NeurIPS challenge.
58
* [MSLAC](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE140203) The Mouse Skin Late Anagen Cell dataset.
59
* [MKC](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE117089) The Mouse Kidney Cell dataset.
60
* [MBC](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE140203)  The Mouse Brain Cell dataset.
61
* [HWBC](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE164378) The Human White Blood Cell dataset.
62
* [HBIC](https://www.nature.com/articles/s41591-021-01329-2#data-availability) The Human Blood Immune Cell dataset.
63
64
65
## Usage
66
67
### Data preparation
68
69
moETM requires cell-by-feature matrices as input, where feature could be gene, protein, or peak. The input data is in `AnnData` format and is loaded and preprocessed by the `load_*_dataset()` and `prepare_*_dataset()` functions in the `dataloader.py` script. Before putting into the model, all matrices are column normalized by dividing the column sum.
70
71
### Integration
72
73
Please run the main script `main_integration_*.py` and edit data path accordingly.
74
75
For the gene + protein case, please refer to `main_integration_rna_protein.py` for details. 
76
For the gene + peak case, please refer to `main_integration_rna_atac.py` for details.
77
78
### Imputation 
79
80
Please refer to `main_cross_prediction_rna_atac.py` and `main_cross_prediction_rna_protein.py` for details. The two scripts are the same during training but different in the data preparation part.
81
82
### Inclusion of prior pathway knowledge
83
84
moETM can use prior pathway knowledge information by adding a pathway-by-gene matrix in the encoder. We downloaded pathways from [MSgiDB](https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp), and selected the C7: immunologic signature gene sets. We kept pathways that contain more than 5 and fewer than 100 genes.
85
86
Please refer to `main_integration_rna_atac_use_pathway.py` for details.
87
88
## Downstream analysis
89
90
Scripts that are used to do downstream analysis or plotting are included in the [`downstream_analysis`](/downstream_analysis) folder.
91