scDEC / Git / Diff of /README.md

Models:

MarcoTheBlack/

scDEC

Downloads: 1

Diff of /README.md [9e0229] .. [5885fa]

Switch to unified view


# scDEC

[![DOI](https://zenodo.org/badge/286327774.svg)](https://zenodo.org/badge/latestdoi/286327774)

![model](https://github.com/kimmo1019/scDEC/blob/master/model.png?raw=true)

scDEC is a computational tool for single cell ATAC-seq data analysis with deep generative neural networks. scDEC enables simultaneously learning the deep embedding and clustering of the cells in an unsupervised manner. scDEC is also applicable to multi-modal single cell data. We tested it on the PBMC paired data (scRNA-seq and scATAC-seq) from 10x Genomics (see Tutorials).

## Recent News

An modified version of scDEC won the first place in [NeurIPS 2021 Multimodal Single-Cell Data Integration competition](https://openproblems.bio/neurips_2021/) two Joint Embedding tasks.

## Requirements
- TensorFlow==1.13.1
- Scikit-learn==0.19.0
- Python==2.7

## Installation
Download scDEC by
```shell
git clone https://github.com/kimmo1019/scDEC
```
Installation has been tested in a Linux platform with Python2.7. GPU is recommended for accelerating the training process.

## Instructions

This section provides instructions on how to run scDEC with scATAC-seq datasets. One can also refer to [Codeocean platform](https://codeocean.com/capsule/0746056) and click `Reproducible Run` on the right. The embedding and clustering results of several datasets will be shown on the right panel.

### Data preparation

Several scATAC-seq datasets have been prepared as the input of scDEC model. These datasets can be downloaded from the [zenode repository](https://zenodo.org/record/3984189#.XzDpJRNKhTY). Uncompress the `datasets.tar.gz` in `datasets` folder then each dataset will have its own subfolder. Each dataset will contain two major files, which denote raw read count matrix (`sc_mat.txt`) and cell label (`label.txt`), respectively. The first column of `sc_mat.txt` represents the peaks information.

### Model training

scDEC is an unsupervised learning model for analyzing scATAC-seq data. One can run 

```python
python main_clustering.py --data [dataset] --K [nb_of_clusters] --dx [x_dim] --dy [y_dim] --train [is_train]
[dataset]  -  the name of the dataset (e.g.,Splenocyte)
[nb_of_clusters]  -  the number of clusters (e.g., 6)
[x_dim]  -  the dimension of Gaussian distribution
[y_dim]  -  the dimension of PCA (defalt: 20)
[is_train] - indicate training from scratch or using pretrained model

```
For an example, one can run `CUDA_VISIBLE_DEVICES=0 python main_clustering.py  --data Splenocyte --K 12 --dx 8 --dy 20` to cluster the scATAC-seq data with pretrained model. Note that the dimension of the embedding should be `K+x_dim`

Or one can run `CUDA_VISIBLE_DEVICES=0 python main_clustering.py  --data Splenocyte --K 12 --dx 8 --dy 20 --train True` to train the model from scratch.

### Model evaluation

If the pretrained model was used, the clustering results in the last step will be directly saved in `results/[dataset]/data_pre.npz` where `dataset` is the name of the scATAC-seq dataset. Note that `data_pre.npz` or `data_at_xxx.npz` contains the predictions from the H network. The first part denotes the embeddings and the second part denotes the inferred one-hot label where one can use `np.argmax` function to get the cluster label.

Then one can run `python eval.py --data [dataset]` to analyze the clustering results. 
For an example, one can run `python eval.py --data Splenocyte`

The t-SNE visualization plot of latent features (`scDEC_embedding.png`), latent feature matrix (`scDEC_embedding.csv`), inferred cluster label (`scDEC_cluster.txt`) will be saved in the `results/[dataset]` folder.


If scDEC model was trained from scratch, the results will be marked by a unique timestamp YYYYMMDD_HHMMSS. This timestamp records the exact time when you run the script. The outputs from the training includes:

 1) `log` files and predicted assignmemnts `data_at_xxx.npz` (xxx denotes different epoch) can be found at folder `results/[dataset]/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2`.
 
 2) Model weights will be saved at folder `checkpoint/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2`. 
 
 3) The training loss curves were recorded at folder `graph/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2`, which can be visualized using TensorBoard.

 Next, one can run 
 
```python
python eval.py --data [dataset] --timestamp [timestamp] --epoch [epoch] --train [is_train]
[dataset]  -  the name of the dataset (e.g.,Splenocyte)
[timestamp]  -  the timestamp of the experiment you ran
[epoch]  -  specify to use the results of which epoch (it can be ignored)
[is_train] - indicate training from scratch 
```

E.g., `python eval.py --data All_blood --timestamp 20200910_143208 --train True`

The t-SNE visualization plot of latent features (`scDEC_embedding.png`), latent feature matrix (`scDEC_embedding.csv`), inferred cluster label (`scDEC_cluster.txt`) will be saved in the same `results` folder as 1).


### Analyzing scATAC dataset without label

One can also use scDEC to analyze custome scATAC-seq dataset, especially the label is unknown. First, the users should prepare raw read count matrix (`sc_mat.txt`) under the folder `datasets/[NAME]`. `[NAME]` denotes the dataset name. 

Second, one can run the following command:

```python
python main_clustering.py --data [dataset] --K [nb_of_clusters] --dx [x_dim] --dy [y_dim] --train [is_train] --no_label
[dataset]  -  the name of the dataset (e.g.,Mydataset)
[nb_of_clusters]  -  the number of clusters (e.g., 6)
[x_dim]  -  the dimension of latent space (continous part)
[y_dim]  -  the dimension of PCA (defalt: 20)
[is_train] - indicate training from scratch 
```

For an example, one can run `CUDA_VISIBLE_DEVICES=0 python main_clustering.py  --data Mydataset --K 10 --dx 5 --dy 20 --train True --no_label` to clustering custom dataset.

Then one can run `python eval.py --data Mydataset --timestamp YYYYMMDD_HHMMSS --epoch epoch --no_label`. Nota time the timestamp `YYYYMMDD_HHMMSS` (for training) and epoch/batch index `epoch` (the last training epoch/batch index is recommended) should be provided. The clustering results (cluster assignments) will be saved in the `results/Mydataset/YYYYMMDD_HHMMSS_xxx` folder.



## Tutorial

[Tutorial Splenocyte](https://github.com/kimmo1019/scDEC/wiki/Splenocyte) Run scDEC on Splenocyte dataset (3166 cells)

[Tutorial Full mouse atlas](https://github.com/kimmo1019/scDEC/wiki/Full-Mouse-atlas) Run scDEC on full Mouse atlas dataset (81173 cells)

[Tutorial PBMC10k paired data ](https://github.com/kimmo1019/scDEC/wiki/PBMC10k) Run scDEC on PBMC data, which contains around 10k cells with both scRNA-seq and scATAC-seq (labels were manually annotated from 10x Genomic R&D group)
 
## Contact

Also Feel free to open an issue in Github or contact `liuqiao@stanford.edu` if you have any problem in running scDEC.

## License

This project is licensed under the MIT License - see the LICENSE.md file for details

	a/README.md		b/README.md
1	# scDEC	1	# scDEC
2		2
3	[![DOI](https://zenodo.org/badge/286327774.svg)](https://zenodo.org/badge/latestdoi/286327774)	3	[![DOI](https://zenodo.org/badge/286327774.svg)](https://zenodo.org/badge/latestdoi/286327774)
4		4
5	![model](https://github.com/kimmo1019/scDEC/blob/master/model.png)	5	![model](https://github.com/kimmo1019/scDEC/blob/master/model.png?raw=true)
6		6
7	scDEC is a computational tool for single cell ATAC-seq data analysis with deep generative neural networks. scDEC enables simultaneously learning the deep embedding and clustering of the cells in an unsupervised manner. scDEC is also applicable to multi-modal single cell data. We tested it on the PBMC paired data (scRNA-seq and scATAC-seq) from 10x Genomics (see Tutorials).	7	scDEC is a computational tool for single cell ATAC-seq data analysis with deep generative neural networks. scDEC enables simultaneously learning the deep embedding and clustering of the cells in an unsupervised manner. scDEC is also applicable to multi-modal single cell data. We tested it on the PBMC paired data (scRNA-seq and scATAC-seq) from 10x Genomics (see Tutorials).
8		8
9	## Recent News	9	## Recent News
10		10
11	An modified version of scDEC won the first place in [NeurIPS 2021 Multimodal Single-Cell Data Integration competition](https://openproblems.bio/neurips_2021/) two Joint Embedding tasks.	11	An modified version of scDEC won the first place in [NeurIPS 2021 Multimodal Single-Cell Data Integration competition](https://openproblems.bio/neurips_2021/) two Joint Embedding tasks.
12		12
13	## Requirements	13	## Requirements
14	- TensorFlow==1.13.1	14	- TensorFlow==1.13.1
15	- Scikit-learn==0.19.0	15	- Scikit-learn==0.19.0
16	- Python==2.7	16	- Python==2.7
17		17
18	## Installation	18	## Installation
19	Download scDEC by	19	Download scDEC by
20	```shell	20	```shell
21	git clone https://github.com/kimmo1019/scDEC	21	git clone https://github.com/kimmo1019/scDEC
22	```	22	```
23	Installation has been tested in a Linux platform with Python2.7. GPU is recommended for accelerating the training process.	23	Installation has been tested in a Linux platform with Python2.7. GPU is recommended for accelerating the training process.
24		24
25	## Instructions	25	## Instructions
26		26
27	This section provides instructions on how to run scDEC with scATAC-seq datasets. One can also refer to [Codeocean platform](https://codeocean.com/capsule/0746056) and click `Reproducible Run` on the right. The embedding and clustering results of several datasets will be shown on the right panel.	27	This section provides instructions on how to run scDEC with scATAC-seq datasets. One can also refer to [Codeocean platform](https://codeocean.com/capsule/0746056) and click `Reproducible Run` on the right. The embedding and clustering results of several datasets will be shown on the right panel.
28		28
29	### Data preparation	29	### Data preparation
30		30
31	Several scATAC-seq datasets have been prepared as the input of scDEC model. These datasets can be downloaded from the [zenode repository](https://zenodo.org/record/3984189#.XzDpJRNKhTY). Uncompress the `datasets.tar.gz` in `datasets` folder then each dataset will have its own subfolder. Each dataset will contain two major files, which denote raw read count matrix (`sc_mat.txt`) and cell label (`label.txt`), respectively. The first column of `sc_mat.txt` represents the peaks information.	31	Several scATAC-seq datasets have been prepared as the input of scDEC model. These datasets can be downloaded from the [zenode repository](https://zenodo.org/record/3984189#.XzDpJRNKhTY). Uncompress the `datasets.tar.gz` in `datasets` folder then each dataset will have its own subfolder. Each dataset will contain two major files, which denote raw read count matrix (`sc_mat.txt`) and cell label (`label.txt`), respectively. The first column of `sc_mat.txt` represents the peaks information.
32		32
33	### Model training	33	### Model training
34		34
35	scDEC is an unsupervised learning model for analyzing scATAC-seq data. One can run	35	scDEC is an unsupervised learning model for analyzing scATAC-seq data. One can run
36		36
37	```python	37	```python
38	python main_clustering.py --data [dataset] --K [nb_of_clusters] --dx [x_dim] --dy [y_dim] --train [is_train]	38	python main_clustering.py --data [dataset] --K [nb_of_clusters] --dx [x_dim] --dy [y_dim] --train [is_train]
39	[dataset] - the name of the dataset (e.g.,Splenocyte)	39	[dataset] - the name of the dataset (e.g.,Splenocyte)
40	[nb_of_clusters] - the number of clusters (e.g., 6)	40	[nb_of_clusters] - the number of clusters (e.g., 6)
41	[x_dim] - the dimension of Gaussian distribution	41	[x_dim] - the dimension of Gaussian distribution
42	[y_dim] - the dimension of PCA (defalt: 20)	42	[y_dim] - the dimension of PCA (defalt: 20)
43	[is_train] - indicate training from scratch or using pretrained model	43	[is_train] - indicate training from scratch or using pretrained model
44		44
45	```	45	```
46	For an example, one can run `CUDA_VISIBLE_DEVICES=0 python main_clustering.py --data Splenocyte --K 12 --dx 8 --dy 20` to cluster the scATAC-seq data with pretrained model. Note that the dimension of the embedding should be `K+x_dim`	46	For an example, one can run `CUDA_VISIBLE_DEVICES=0 python main_clustering.py --data Splenocyte --K 12 --dx 8 --dy 20` to cluster the scATAC-seq data with pretrained model. Note that the dimension of the embedding should be `K+x_dim`
47		47
48	Or one can run `CUDA_VISIBLE_DEVICES=0 python main_clustering.py --data Splenocyte --K 12 --dx 8 --dy 20 --train True` to train the model from scratch.	48	Or one can run `CUDA_VISIBLE_DEVICES=0 python main_clustering.py --data Splenocyte --K 12 --dx 8 --dy 20 --train True` to train the model from scratch.
49		49
50	### Model evaluation	50	### Model evaluation
51		51
52	If the pretrained model was used, the clustering results in the last step will be directly saved in `results/[dataset]/data_pre.npz` where `dataset` is the name of the scATAC-seq dataset. Note that `data_pre.npz` or `data_at_xxx.npz` contains the predictions from the H network. The first part denotes the embeddings and the second part denotes the inferred one-hot label where one can use `np.argmax` function to get the cluster label.	52	If the pretrained model was used, the clustering results in the last step will be directly saved in `results/[dataset]/data_pre.npz` where `dataset` is the name of the scATAC-seq dataset. Note that `data_pre.npz` or `data_at_xxx.npz` contains the predictions from the H network. The first part denotes the embeddings and the second part denotes the inferred one-hot label where one can use `np.argmax` function to get the cluster label.
53		53
54	Then one can run `python eval.py --data [dataset]` to analyze the clustering results.	54	Then one can run `python eval.py --data [dataset]` to analyze the clustering results.
55	For an example, one can run `python eval.py --data Splenocyte`	55	For an example, one can run `python eval.py --data Splenocyte`
56		56
57	The t-SNE visualization plot of latent features (`scDEC_embedding.png`), latent feature matrix (`scDEC_embedding.csv`), inferred cluster label (`scDEC_cluster.txt`) will be saved in the `results/[dataset]` folder.	57	The t-SNE visualization plot of latent features (`scDEC_embedding.png`), latent feature matrix (`scDEC_embedding.csv`), inferred cluster label (`scDEC_cluster.txt`) will be saved in the `results/[dataset]` folder.
58		58
59		59
60	If scDEC model was trained from scratch, the results will be marked by a unique timestamp YYYYMMDD_HHMMSS. This timestamp records the exact time when you run the script. The outputs from the training includes:	60	If scDEC model was trained from scratch, the results will be marked by a unique timestamp YYYYMMDD_HHMMSS. This timestamp records the exact time when you run the script. The outputs from the training includes:
61		61
62	1) `log` files and predicted assignmemnts `data_at_xxx.npz` (xxx denotes different epoch) can be found at folder `results/[dataset]/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2`.	62	1) `log` files and predicted assignmemnts `data_at_xxx.npz` (xxx denotes different epoch) can be found at folder `results/[dataset]/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2`.
63		63
64	2) Model weights will be saved at folder `checkpoint/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2`.	64	2) Model weights will be saved at folder `checkpoint/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2`.
65		65
66	3) The training loss curves were recorded at folder `graph/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2`, which can be visualized using TensorBoard.	66	3) The training loss curves were recorded at folder `graph/YYYYMMDD_HHMMSS_x_dim=8_y_dim=20_alpha=10.0_beta=10.0_ratio=0.2`, which can be visualized using TensorBoard.
67		67
68	Next, one can run	68	Next, one can run
69		69
70	```python	70	```python
71	python eval.py --data [dataset] --timestamp [timestamp] --epoch [epoch] --train [is_train]	71	python eval.py --data [dataset] --timestamp [timestamp] --epoch [epoch] --train [is_train]
72	[dataset] - the name of the dataset (e.g.,Splenocyte)	72	[dataset] - the name of the dataset (e.g.,Splenocyte)
73	[timestamp] - the timestamp of the experiment you ran	73	[timestamp] - the timestamp of the experiment you ran
74	[epoch] - specify to use the results of which epoch (it can be ignored)	74	[epoch] - specify to use the results of which epoch (it can be ignored)
75	[is_train] - indicate training from scratch	75	[is_train] - indicate training from scratch
76	```	76	```
77		77
78	E.g., `python eval.py --data All_blood --timestamp 20200910_143208 --train True`	78	E.g., `python eval.py --data All_blood --timestamp 20200910_143208 --train True`
79		79
80	The t-SNE visualization plot of latent features (`scDEC_embedding.png`), latent feature matrix (`scDEC_embedding.csv`), inferred cluster label (`scDEC_cluster.txt`) will be saved in the same `results` folder as 1).	80	The t-SNE visualization plot of latent features (`scDEC_embedding.png`), latent feature matrix (`scDEC_embedding.csv`), inferred cluster label (`scDEC_cluster.txt`) will be saved in the same `results` folder as 1).
81		81
82		82
83	### Analyzing scATAC dataset without label	83	### Analyzing scATAC dataset without label
84		84
85	One can also use scDEC to analyze custome scATAC-seq dataset, especially the label is unknown. First, the users should prepare raw read count matrix (`sc_mat.txt`) under the folder `datasets/[NAME]`. `[NAME]` denotes the dataset name.	85	One can also use scDEC to analyze custome scATAC-seq dataset, especially the label is unknown. First, the users should prepare raw read count matrix (`sc_mat.txt`) under the folder `datasets/[NAME]`. `[NAME]` denotes the dataset name.
86		86
87	Second, one can run the following command:	87	Second, one can run the following command:
88		88
89	```python	89	```python
90	python main_clustering.py --data [dataset] --K [nb_of_clusters] --dx [x_dim] --dy [y_dim] --train [is_train] --no_label	90	python main_clustering.py --data [dataset] --K [nb_of_clusters] --dx [x_dim] --dy [y_dim] --train [is_train] --no_label
91	[dataset] - the name of the dataset (e.g.,Mydataset)	91	[dataset] - the name of the dataset (e.g.,Mydataset)
92	[nb_of_clusters] - the number of clusters (e.g., 6)	92	[nb_of_clusters] - the number of clusters (e.g., 6)
93	[x_dim] - the dimension of latent space (continous part)	93	[x_dim] - the dimension of latent space (continous part)
94	[y_dim] - the dimension of PCA (defalt: 20)	94	[y_dim] - the dimension of PCA (defalt: 20)
95	[is_train] - indicate training from scratch	95	[is_train] - indicate training from scratch
96	```	96	```
97		97
98	For an example, one can run `CUDA_VISIBLE_DEVICES=0 python main_clustering.py --data Mydataset --K 10 --dx 5 --dy 20 --train True --no_label` to clustering custom dataset.	98	For an example, one can run `CUDA_VISIBLE_DEVICES=0 python main_clustering.py --data Mydataset --K 10 --dx 5 --dy 20 --train True --no_label` to clustering custom dataset.
99		99
100	Then one can run `python eval.py --data Mydataset --timestamp YYYYMMDD_HHMMSS --epoch epoch --no_label`. Nota time the timestamp `YYYYMMDD_HHMMSS` (for training) and epoch/batch index `epoch` (the last training epoch/batch index is recommended) should be provided. The clustering results (cluster assignments) will be saved in the `results/Mydataset/YYYYMMDD_HHMMSS_xxx` folder.	100	Then one can run `python eval.py --data Mydataset --timestamp YYYYMMDD_HHMMSS --epoch epoch --no_label`. Nota time the timestamp `YYYYMMDD_HHMMSS` (for training) and epoch/batch index `epoch` (the last training epoch/batch index is recommended) should be provided. The clustering results (cluster assignments) will be saved in the `results/Mydataset/YYYYMMDD_HHMMSS_xxx` folder.
101		101
102		102
103		103
104	## Tutorial	104	## Tutorial
105		105
106	[Tutorial Splenocyte](https://github.com/kimmo1019/scDEC/wiki/Splenocyte) Run scDEC on Splenocyte dataset (3166 cells)	106	[Tutorial Splenocyte](https://github.com/kimmo1019/scDEC/wiki/Splenocyte) Run scDEC on Splenocyte dataset (3166 cells)
107		107
108	[Tutorial Full mouse atlas](https://github.com/kimmo1019/scDEC/wiki/Full-Mouse-atlas) Run scDEC on full Mouse atlas dataset (81173 cells)	108	[Tutorial Full mouse atlas](https://github.com/kimmo1019/scDEC/wiki/Full-Mouse-atlas) Run scDEC on full Mouse atlas dataset (81173 cells)
109		109
110	[Tutorial PBMC10k paired data ](https://github.com/kimmo1019/scDEC/wiki/PBMC10k) Run scDEC on PBMC data, which contains around 10k cells with both scRNA-seq and scATAC-seq (labels were manually annotated from 10x Genomic R&D group)	110	[Tutorial PBMC10k paired data ](https://github.com/kimmo1019/scDEC/wiki/PBMC10k) Run scDEC on PBMC data, which contains around 10k cells with both scRNA-seq and scATAC-seq (labels were manually annotated from 10x Genomic R&D group)
111		111
112	## Contact	112	## Contact
113		113
114	Also Feel free to open an issue in Github or contact `liuqiao@stanford.edu` if you have any problem in running scDEC.	114	Also Feel free to open an issue in Github or contact `liuqiao@stanford.edu` if you have any problem in running scDEC.
115		115
116	## License	116	## License
117		117
118	This project is licensed under the MIT License - see the LICENSE.md file for details	118	This project is licensed under the MIT License - see the LICENSE.md file for details