|
a |
|
b/data/README.md |
|
|
1 |
# Datasets |
|
|
2 |
|
|
|
3 |
|
|
|
4 |
Our datasets contain both real and synthetic data (from [dyngen](https://github.com/dynverse/dyngen) and our model), with both UMI and non-UMI counts, and various trajectory topologies. They are also available at [Zenodo](http://doi.org/10.5281/zenodo.14974835). |
|
|
5 |
|
|
|
6 |
Due to the storage limit of Github, the datasets for case studies are only avaiable at [Zenodo](http://doi.org/10.5281/zenodo.14974835). |
|
|
7 |
|
|
|
8 |
|
|
|
9 |
## Benchmark datasets |
|
|
10 |
|
|
|
11 |
type|name|count type|topology|N|G|k|source |
|
|
12 |
---|---|---|---|---|---|---|--- |
|
|
13 |
real | aging | non-UMI | linear | 873 | 2815 | 3 | [Kowalczyk, *et al* (2015)](https://doi.org/10.1101/gr.192237.115) |
|
|
14 |
real | human\_embryos | non-UMI | linear | 1289 | 8772 | 5 | [Petropoulos *et al.* (2016)](https://doi.org/10.1016/j.cell.2016.03.023) |
|
|
15 |
real | germline | non-UMI | bifurcation | 272 | 8772 | 7 | [Guo *et al.* (2015)](https://doi.org/10.1016/j.cell.2015.05.015) |
|
|
16 |
real | fibroblast | non-UMI | bifurcation | 355 | 3379 | 7 | [Treutlein *et al.* (2016)](https://doi.org/10.1038/nature18323) |
|
|
17 |
real | mesoderm | non-UMI | tree | 504 | 8772 | 9 | [Loh *et al.* (2016)](https://doi.org/10.1016/j.cell.2016.06.011) |
|
|
18 |
real | cell\_cycle | non-UMI | cycle | 264 | 6812 | 3 | [Petropoulos *et al.* (2016)](https://doi.org/10.1016/j.cell.2016.03.023) |
|
|
19 |
real | dentate | UMI | linear | 3585 | 2182 | 5 | [Hochgerner *et al.* (2018)](https://doi.org/10.1038/s41593-017-0056-2) |
|
|
20 |
real | planaria\_muscle | UMI | bifurcation | 2338 | 4210 | 3 | [Wolf *et al.* (2019)](https://doi.org/10.1186/s13059-019-1663-x) |
|
|
21 |
real | planaria\_full | UMI | tree | 18837 | 4210 | 33 | [Wolf *et al.* (2019)](https://doi.org/10.1186/s13059-019-1663-x) |
|
|
22 |
real | immune | UMI | disconnected | 21082 | 18750 | 3 | [zheng *et al.* (2017)](https://doi.org/10.1038/ncomms14049) |
|
|
23 |
synthetic | linear\_1 | non-UMI | linear | 2000 | 991 | 4 | dyngen |
|
|
24 |
synthetic | linear\_2 | non-UMI | linear | 2000 | 999 | 4 | dyngen |
|
|
25 |
synthetic | linear\_3 | non-UMI | linear | 2000 | 1000 | 4 | dyngen |
|
|
26 |
synthetic | bifurcating\_1 | non-UMI | bifurcation | 2000 | 997 | 7 | dyngen |
|
|
27 |
synthetic | bifurcating\_2 | non-UMI | bifurcation | 2000 | 991 | 7 | dyngen |
|
|
28 |
synthetic | bifurcating\_3 | non-UMI | bifurcation | 2000 | 1000 | 7 | dyngen |
|
|
29 |
synthetic | trifurcating\_1 | non-UMI | multifurcating | 2000 | 969 | 10 | dyngen |
|
|
30 |
synthetic | trifurcating\_2 | non-UMI | multifurcating | 2000 | 995 | 10 | dyngen |
|
|
31 |
synthetic | converging\_1 | non-UMI | bifurcation | 2000 | 998 | 6 | dyngen |
|
|
32 |
synthetic | cycle\_1 | non-UMI | cycle | 2000 | 1000 | 3 | dyngen |
|
|
33 |
synthetic | cycle\_2 | non-UMI | cycle | 2000 | 999 | 3 | dyngen |
|
|
34 |
synthetic | linear | UMI | linear | 1900 | 1990 | 5 | our model |
|
|
35 |
synthetic | bifurcation | UMI | bifurcation | 2100 | 1996 | 5 | our model |
|
|
36 |
synthetic | multifurcating | UMI | multifurcating | 2700 | 2000 | 7 | our model |
|
|
37 |
synthetic | tree | UMI | tree | 2600 | 2000 | 7 | our model |
|
|
38 |
|
|
|
39 |
|
|
|
40 |
|
|
|
41 |
## Case study datasets |
|
|
42 |
|
|
|
43 |
|
|
|
44 |
study|name|N|G|k|source |
|
|
45 |
---|---|---|---|---|--- |
|
|
46 |
Mouse brain | mouse\_brain\_merged | 6390 <br> 10261 | 14707 | 15 | [Yuzwa *et al.* (2017)](https://doi.org/10.1016/j.celrep.2017.12.017),<br> [Ruan *et al.* (2021)](https://doi.org/10.1073/pnas.2018866118) |
|
|
47 |
Mouse cortex | mouse\_cortex\_dibella | 91648 | 19712 | 24 | [Di Bella *et al.* (2021)](https://doi.org/10.1038/s41586-021-03670-5) |
|
|
48 |
Human hematopoiesis | human_hematopoiesis_scRNA <br> human_hematopoiesis_scATAC <br> human_hematopoiesis_motif | 34901 <br> 33819 <br> 33819 | 15714 <br> 15714 <br> 1764 | 21 | [Granja *et al.* (2019)](https://doi.org/10.1038/s41587-019-0332-7) |
|
|
49 |
|
|
|
50 |
|
|
|
51 |
Note: For mouse\_cortex\_dibella, the downloaded data from the original paper is after library size normalization and log transformation. The slot `count` of mouse\_cortex\_dibella is the transformed data. |
|
|
52 |
|
|
|
53 |
|
|
|
54 |
# Usage |
|
|
55 |
|
|
|
56 |
## Python |
|
|
57 |
|
|
|
58 |
Our package provides a function to load these datasets: |
|
|
59 |
|
|
|
60 |
```python |
|
|
61 |
from VITAE import load_data |
|
|
62 |
file_name = 'dentate' |
|
|
63 |
|
|
|
64 |
# by default, it returns an anndata object |
|
|
65 |
adata = load_data(path='data/', file_name=file_name) |
|
|
66 |
|
|
|
67 |
# if you want to get a dictionary, set return_dict=True |
|
|
68 |
data, adata = load_data(path='data/', |
|
|
69 |
file_name=file_name, |
|
|
70 |
return_dict = True |
|
|
71 |
) |
|
|
72 |
print(data.keys()) |
|
|
73 |
# dict_keys(['count', 'grouping', 'gene_names', 'cell_ids', 'milestone_network', 'root_milestone_id', 'type']) |
|
|
74 |
``` |
|
|
75 |
|
|
|
76 |
## R |
|
|
77 |
|
|
|
78 |
```R |
|
|
79 |
library(hdf5r) |
|
|
80 |
file.h5 <- H5File$new('dentate.h5', mode='r') # open file |
|
|
81 |
file.h5 # overview |
|
|
82 |
names(file.h5) # keys of content |
|
|
83 |
count <- t(file.h5[['count']][,]) # transpose to get num_cells*num_genes if necessary |
|
|
84 |
file.h5$close_all() # close file |
|
|
85 |
``` |
|
|
86 |
|
|
|
87 |
|
|
|
88 |
# Field |
|
|
89 |
|
|
|
90 |
For the returned dict, all possible fields for these datasets are shown below. Note that not every dataset have all these fields. For example, `covariates` only available in the dataset in case studies. |
|
|
91 |
|
|
|
92 |
|
|
|
93 |
key|detail |
|
|
94 |
---|--- |
|
|
95 |
count | A two-dim array of counts. |
|
|
96 |
grouping | A one-dim array of reference labels of cells. |
|
|
97 |
gene\_names | A one-dim array of gene names. |
|
|
98 |
cell\_ids | A one-dim array of cell ids. |
|
|
99 |
covariates | A two-dim array of covariates, e.g., cell-cycle scores and the indicator of data sources. |
|
|
100 |
milestone_network | A dataframe of the reference connectivity network of cell types. For real data, it is a dataframe indicating the transition of each vertex with columns `from` and `to`. For synthetic data, it is a dataframe indicating the transition of each cell with columns `from`, `to` and `w`. |
|
|
101 |
root\_milestone\_id | The name of the root vertex of the trajectory. |
|
|
102 |
type | 'UMI' or 'non-UMI' |
|
|
103 |
|
|
|
104 |
|
|
|
105 |
|
|
|
106 |
|
|
|
107 |
|
|
|
108 |
|
|
|
109 |
|
|
|
110 |
|
|
|
111 |
|