[2c6b19]: / data / README.md

Download this file

112 lines (75 with data), 5.6 kB

Datasets

Our datasets contain both real and synthetic data (from dyngen and our model), with both UMI and non-UMI counts, and various trajectory topologies. They are also available at Zenodo.

Due to the storage limit of Github, the datasets for case studies are only avaiable at Zenodo.

Benchmark datasets

type name count type topology N G k source
real aging non-UMI linear 873 2815 3 Kowalczyk, et al (2015)
real human_embryos non-UMI linear 1289 8772 5 Petropoulos et al. (2016)
real germline non-UMI bifurcation 272 8772 7 Guo et al. (2015)
real fibroblast non-UMI bifurcation 355 3379 7 Treutlein et al. (2016)
real mesoderm non-UMI tree 504 8772 9 Loh et al. (2016)
real cell_cycle non-UMI cycle 264 6812 3 Petropoulos et al. (2016)
real dentate UMI linear 3585 2182 5 Hochgerner et al. (2018)
real planaria_muscle UMI bifurcation 2338 4210 3 Wolf et al. (2019)
real planaria_full UMI tree 18837 4210 33 Wolf et al. (2019)
real immune UMI disconnected 21082 18750 3 zheng et al. (2017)
synthetic linear_1 non-UMI linear 2000 991 4 dyngen
synthetic linear_2 non-UMI linear 2000 999 4 dyngen
synthetic linear_3 non-UMI linear 2000 1000 4 dyngen
synthetic bifurcating_1 non-UMI bifurcation 2000 997 7 dyngen
synthetic bifurcating_2 non-UMI bifurcation 2000 991 7 dyngen
synthetic bifurcating_3 non-UMI bifurcation 2000 1000 7 dyngen
synthetic trifurcating_1 non-UMI multifurcating 2000 969 10 dyngen
synthetic trifurcating_2 non-UMI multifurcating 2000 995 10 dyngen
synthetic converging_1 non-UMI bifurcation 2000 998 6 dyngen
synthetic cycle_1 non-UMI cycle 2000 1000 3 dyngen
synthetic cycle_2 non-UMI cycle 2000 999 3 dyngen
synthetic linear UMI linear 1900 1990 5 our model
synthetic bifurcation UMI bifurcation 2100 1996 5 our model
synthetic multifurcating UMI multifurcating 2700 2000 7 our model
synthetic tree UMI tree 2600 2000 7 our model

Case study datasets

study name N G k source
Mouse brain mouse_brain_merged 6390
10261
14707 15 Yuzwa et al. (2017),
Ruan et al. (2021)
Mouse cortex mouse_cortex_dibella 91648 19712 24 Di Bella et al. (2021)
Human hematopoiesis human_hematopoiesis_scRNA
human_hematopoiesis_scATAC
human_hematopoiesis_motif
34901
33819
33819
15714
15714
1764
21 Granja et al. (2019)

Note: For mouse_cortex_dibella, the downloaded data from the original paper is after library size normalization and log transformation. The slot count of mouse_cortex_dibella is the transformed data.

Usage

Python

Our package provides a function to load these datasets:

from VITAE import load_data
file_name = 'dentate'

# by default, it returns an anndata object
adata  = load_data(path='data/', file_name=file_name)

# if you want to get a dictionary, set return_dict=True
data, adata = load_data(path='data/',
                 file_name=file_name,
                 return_dict = True
                 )
print(data.keys())
# dict_keys(['count', 'grouping', 'gene_names', 'cell_ids', 'milestone_network', 'root_milestone_id', 'type'])

R

library(hdf5r)
file.h5 <- H5File$new('dentate.h5', mode='r')   # open file
file.h5                                         # overview
names(file.h5)                                  # keys of content
count <- t(file.h5[['count']][,])               # transpose to get num_cells*num_genes if necessary
file.h5$close_all()                             # close file

Field

For the returned dict, all possible fields for these datasets are shown below. Note that not every dataset have all these fields. For example, covariates only available in the dataset in case studies.

key detail
count A two-dim array of counts.
grouping A one-dim array of reference labels of cells.
gene_names A one-dim array of gene names.
cell_ids A one-dim array of cell ids.
covariates A two-dim array of covariates, e.g., cell-cycle scores and the indicator of data sources.
milestone_network A dataframe of the reference connectivity network of cell types. For real data, it is a dataframe indicating the transition of each vertex with columns from and to. For synthetic data, it is a dataframe indicating the transition of each cell with columns from, to and w.
root_milestone_id The name of the root vertex of the trajectory.
type 'UMI' or 'non-UMI'