Datasets

Our datasets contain both real and synthetic data (from dyngen and our model), with both UMI and non-UMI counts, and various trajectory topologies. They are also available at Zenodo.

Due to the storage limit of Github, the datasets for case studies are only avaiable at Zenodo.

Benchmark datasets

type	name	count type	topology	N	G	k	source
real	aging	non-UMI	linear	873	2815	3	Kowalczyk, et al (2015)
real	human_embryos	non-UMI	linear	1289	8772	5	Petropoulos et al. (2016)
real	germline	non-UMI	bifurcation	272	8772	7	Guo et al. (2015)
real	fibroblast	non-UMI	bifurcation	355	3379	7	Treutlein et al. (2016)
real	mesoderm	non-UMI	tree	504	8772	9	Loh et al. (2016)
real	cell_cycle	non-UMI	cycle	264	6812	3	Petropoulos et al. (2016)
real	dentate	UMI	linear	3585	2182	5	Hochgerner et al. (2018)
real	planaria_muscle	UMI	bifurcation	2338	4210	3	Wolf et al. (2019)
real	planaria_full	UMI	tree	18837	4210	33	Wolf et al. (2019)
real	immune	UMI	disconnected	21082	18750	3	zheng et al. (2017)
synthetic	linear_1	non-UMI	linear	2000	991	4	dyngen
synthetic	linear_2	non-UMI	linear	2000	999	4	dyngen
synthetic	linear_3	non-UMI	linear	2000	1000	4	dyngen
synthetic	bifurcating_1	non-UMI	bifurcation	2000	997	7	dyngen
synthetic	bifurcating_2	non-UMI	bifurcation	2000	991	7	dyngen
synthetic	bifurcating_3	non-UMI	bifurcation	2000	1000	7	dyngen
synthetic	trifurcating_1	non-UMI	multifurcating	2000	969	10	dyngen
synthetic	trifurcating_2	non-UMI	multifurcating	2000	995	10	dyngen
synthetic	converging_1	non-UMI	bifurcation	2000	998	6	dyngen
synthetic	cycle_1	non-UMI	cycle	2000	1000	3	dyngen
synthetic	cycle_2	non-UMI	cycle	2000	999	3	dyngen
synthetic	linear	UMI	linear	1900	1990	5	our model
synthetic	bifurcation	UMI	bifurcation	2100	1996	5	our model
synthetic	multifurcating	UMI	multifurcating	2700	2000	7	our model
synthetic	tree	UMI	tree	2600	2000	7	our model

Case study datasets

study	name	N	G	k	source
Mouse brain	mouse_brain_merged	6390 10261	14707	15	Yuzwa et al. (2017), Ruan et al. (2021)
Mouse cortex	mouse_cortex_dibella	91648	19712	24	Di Bella et al. (2021)
Human hematopoiesis	human_hematopoiesis_scRNA human_hematopoiesis_scATAC human_hematopoiesis_motif	34901 33819 33819	15714 15714 1764	21	Granja et al. (2019)

Note: For mouse_cortex_dibella, the downloaded data from the original paper is after library size normalization and log transformation. The slot count of mouse_cortex_dibella is the transformed data.

Usage

Python

Our package provides a function to load these datasets:

from VITAE import load_data
file_name = 'dentate'

# by default, it returns an anndata object
adata  = load_data(path='data/', file_name=file_name)

# if you want to get a dictionary, set return_dict=True
data, adata = load_data(path='data/',
                 file_name=file_name,
                 return_dict = True
                 )
print(data.keys())
# dict_keys(['count', 'grouping', 'gene_names', 'cell_ids', 'milestone_network', 'root_milestone_id', 'type'])

R

library(hdf5r)
file.h5 <- H5File$new('dentate.h5', mode='r')   # open file
file.h5                                         # overview
names(file.h5)                                  # keys of content
count <- t(file.h5[['count']][,])               # transpose to get num_cells*num_genes if necessary
file.h5$close_all()                             # close file

Field

For the returned dict, all possible fields for these datasets are shown below. Note that not every dataset have all these fields. For example, covariates only available in the dataset in case studies.

key	detail
count	A two-dim array of counts.
grouping	A one-dim array of reference labels of cells.
gene_names	A one-dim array of gene names.
cell_ids	A one-dim array of cell ids.
covariates	A two-dim array of covariates, e.g., cell-cycle scores and the indicator of data sources.
milestone_network	A dataframe of the reference connectivity network of cell types. For real data, it is a dataframe indicating the transition of each vertex with columns `from` and `to`. For synthetic data, it is a dataframe indicating the transition of each cell with columns `from`, `to` and `w`.
root_milestone_id	The name of the root vertex of the trajectory.
type	'UMI' or 'non-UMI'