Our datasets contain both real and synthetic data (from dyngen and our model), with both UMI and non-UMI counts, and various trajectory topologies. They are also available at Zenodo.
Due to the storage limit of Github, the datasets for case studies are only avaiable at Zenodo.
type | name | count type | topology | N | G | k | source |
---|---|---|---|---|---|---|---|
real | aging | non-UMI | linear | 873 | 2815 | 3 | Kowalczyk, et al (2015) |
real | human_embryos | non-UMI | linear | 1289 | 8772 | 5 | Petropoulos et al. (2016) |
real | germline | non-UMI | bifurcation | 272 | 8772 | 7 | Guo et al. (2015) |
real | fibroblast | non-UMI | bifurcation | 355 | 3379 | 7 | Treutlein et al. (2016) |
real | mesoderm | non-UMI | tree | 504 | 8772 | 9 | Loh et al. (2016) |
real | cell_cycle | non-UMI | cycle | 264 | 6812 | 3 | Petropoulos et al. (2016) |
real | dentate | UMI | linear | 3585 | 2182 | 5 | Hochgerner et al. (2018) |
real | planaria_muscle | UMI | bifurcation | 2338 | 4210 | 3 | Wolf et al. (2019) |
real | planaria_full | UMI | tree | 18837 | 4210 | 33 | Wolf et al. (2019) |
real | immune | UMI | disconnected | 21082 | 18750 | 3 | zheng et al. (2017) |
synthetic | linear_1 | non-UMI | linear | 2000 | 991 | 4 | dyngen |
synthetic | linear_2 | non-UMI | linear | 2000 | 999 | 4 | dyngen |
synthetic | linear_3 | non-UMI | linear | 2000 | 1000 | 4 | dyngen |
synthetic | bifurcating_1 | non-UMI | bifurcation | 2000 | 997 | 7 | dyngen |
synthetic | bifurcating_2 | non-UMI | bifurcation | 2000 | 991 | 7 | dyngen |
synthetic | bifurcating_3 | non-UMI | bifurcation | 2000 | 1000 | 7 | dyngen |
synthetic | trifurcating_1 | non-UMI | multifurcating | 2000 | 969 | 10 | dyngen |
synthetic | trifurcating_2 | non-UMI | multifurcating | 2000 | 995 | 10 | dyngen |
synthetic | converging_1 | non-UMI | bifurcation | 2000 | 998 | 6 | dyngen |
synthetic | cycle_1 | non-UMI | cycle | 2000 | 1000 | 3 | dyngen |
synthetic | cycle_2 | non-UMI | cycle | 2000 | 999 | 3 | dyngen |
synthetic | linear | UMI | linear | 1900 | 1990 | 5 | our model |
synthetic | bifurcation | UMI | bifurcation | 2100 | 1996 | 5 | our model |
synthetic | multifurcating | UMI | multifurcating | 2700 | 2000 | 7 | our model |
synthetic | tree | UMI | tree | 2600 | 2000 | 7 | our model |
study | name | N | G | k | source |
---|---|---|---|---|---|
Mouse brain | mouse_brain_merged | 6390 10261 |
14707 | 15 | Yuzwa et al. (2017), Ruan et al. (2021) |
Mouse cortex | mouse_cortex_dibella | 91648 | 19712 | 24 | Di Bella et al. (2021) |
Human hematopoiesis | human_hematopoiesis_scRNA human_hematopoiesis_scATAC human_hematopoiesis_motif |
34901 33819 33819 |
15714 15714 1764 |
21 | Granja et al. (2019) |
Note: For mouse_cortex_dibella, the downloaded data from the original paper is after library size normalization and log transformation. The slot count
of mouse_cortex_dibella is the transformed data.
Our package provides a function to load these datasets:
from VITAE import load_data
file_name = 'dentate'
# by default, it returns an anndata object
adata = load_data(path='data/', file_name=file_name)
# if you want to get a dictionary, set return_dict=True
data, adata = load_data(path='data/',
file_name=file_name,
return_dict = True
)
print(data.keys())
# dict_keys(['count', 'grouping', 'gene_names', 'cell_ids', 'milestone_network', 'root_milestone_id', 'type'])
library(hdf5r)
file.h5 <- H5File$new('dentate.h5', mode='r') # open file
file.h5 # overview
names(file.h5) # keys of content
count <- t(file.h5[['count']][,]) # transpose to get num_cells*num_genes if necessary
file.h5$close_all() # close file
For the returned dict, all possible fields for these datasets are shown below. Note that not every dataset have all these fields. For example, covariates
only available in the dataset in case studies.
key | detail |
---|---|
count | A two-dim array of counts. |
grouping | A one-dim array of reference labels of cells. |
gene_names | A one-dim array of gene names. |
cell_ids | A one-dim array of cell ids. |
covariates | A two-dim array of covariates, e.g., cell-cycle scores and the indicator of data sources. |
milestone_network | A dataframe of the reference connectivity network of cell types. For real data, it is a dataframe indicating the transition of each vertex with columns from and to . For synthetic data, it is a dataframe indicating the transition of each cell with columns from , to and w . |
root_milestone_id | The name of the root vertex of the trajectory. |
type | 'UMI' or 'non-UMI' |