--- a +++ b/data/README.md @@ -0,0 +1,111 @@ +# Datasets + + +Our datasets contain both real and synthetic data (from [dyngen](https://github.com/dynverse/dyngen) and our model), with both UMI and non-UMI counts, and various trajectory topologies. They are also available at [Zenodo](http://doi.org/10.5281/zenodo.14974835). + +Due to the storage limit of Github, the datasets for case studies are only avaiable at [Zenodo](http://doi.org/10.5281/zenodo.14974835). + + +## Benchmark datasets + +type|name|count type|topology|N|G|k|source +---|---|---|---|---|---|---|--- +real | aging | non-UMI | linear | 873 | 2815 | 3 | [Kowalczyk, *et al* (2015)](https://doi.org/10.1101/gr.192237.115) +real | human\_embryos | non-UMI | linear | 1289 | 8772 | 5 | [Petropoulos *et al.* (2016)](https://doi.org/10.1016/j.cell.2016.03.023) +real | germline | non-UMI | bifurcation | 272 | 8772 | 7 | [Guo *et al.* (2015)](https://doi.org/10.1016/j.cell.2015.05.015) +real | fibroblast | non-UMI | bifurcation | 355 | 3379 | 7 | [Treutlein *et al.* (2016)](https://doi.org/10.1038/nature18323) +real | mesoderm | non-UMI | tree | 504 | 8772 | 9 | [Loh *et al.* (2016)](https://doi.org/10.1016/j.cell.2016.06.011) +real | cell\_cycle | non-UMI | cycle | 264 | 6812 | 3 | [Petropoulos *et al.* (2016)](https://doi.org/10.1016/j.cell.2016.03.023) +real | dentate | UMI | linear | 3585 | 2182 | 5 | [Hochgerner *et al.* (2018)](https://doi.org/10.1038/s41593-017-0056-2) +real | planaria\_muscle | UMI | bifurcation | 2338 | 4210 | 3 | [Wolf *et al.* (2019)](https://doi.org/10.1186/s13059-019-1663-x) +real | planaria\_full | UMI | tree | 18837 | 4210 | 33 | [Wolf *et al.* (2019)](https://doi.org/10.1186/s13059-019-1663-x) +real | immune | UMI | disconnected | 21082 | 18750 | 3 | [zheng *et al.* (2017)](https://doi.org/10.1038/ncomms14049) +synthetic | linear\_1 | non-UMI | linear | 2000 | 991 | 4 | dyngen +synthetic | linear\_2 | non-UMI | linear | 2000 | 999 | 4 | dyngen +synthetic | linear\_3 | non-UMI | linear | 2000 | 1000 | 4 | dyngen +synthetic | bifurcating\_1 | non-UMI | bifurcation | 2000 | 997 | 7 | dyngen +synthetic | bifurcating\_2 | non-UMI | bifurcation | 2000 | 991 | 7 | dyngen +synthetic | bifurcating\_3 | non-UMI | bifurcation | 2000 | 1000 | 7 | dyngen +synthetic | trifurcating\_1 | non-UMI | multifurcating | 2000 | 969 | 10 | dyngen +synthetic | trifurcating\_2 | non-UMI | multifurcating | 2000 | 995 | 10 | dyngen +synthetic | converging\_1 | non-UMI | bifurcation | 2000 | 998 | 6 | dyngen +synthetic | cycle\_1 | non-UMI | cycle | 2000 | 1000 | 3 | dyngen +synthetic | cycle\_2 | non-UMI | cycle | 2000 | 999 | 3 | dyngen +synthetic | linear | UMI | linear | 1900 | 1990 | 5 | our model +synthetic | bifurcation | UMI | bifurcation | 2100 | 1996 | 5 | our model +synthetic | multifurcating | UMI | multifurcating | 2700 | 2000 | 7 | our model +synthetic | tree | UMI | tree | 2600 | 2000 | 7 | our model + + + +## Case study datasets + + +study|name|N|G|k|source +---|---|---|---|---|--- +Mouse brain | mouse\_brain\_merged | 6390 <br> 10261 | 14707 | 15 | [Yuzwa *et al.* (2017)](https://doi.org/10.1016/j.celrep.2017.12.017),<br> [Ruan *et al.* (2021)](https://doi.org/10.1073/pnas.2018866118) +Mouse cortex | mouse\_cortex\_dibella | 91648 | 19712 | 24 | [Di Bella *et al.* (2021)](https://doi.org/10.1038/s41586-021-03670-5) +Human hematopoiesis | human_hematopoiesis_scRNA <br> human_hematopoiesis_scATAC <br> human_hematopoiesis_motif | 34901 <br> 33819 <br> 33819 | 15714 <br> 15714 <br> 1764 | 21 | [Granja *et al.* (2019)](https://doi.org/10.1038/s41587-019-0332-7) + + +Note: For mouse\_cortex\_dibella, the downloaded data from the original paper is after library size normalization and log transformation. The slot `count` of mouse\_cortex\_dibella is the transformed data. + + +# Usage + +## Python + +Our package provides a function to load these datasets: + +```python +from VITAE import load_data +file_name = 'dentate' + +# by default, it returns an anndata object +adata = load_data(path='data/', file_name=file_name) + +# if you want to get a dictionary, set return_dict=True +data, adata = load_data(path='data/', + file_name=file_name, + return_dict = True + ) +print(data.keys()) +# dict_keys(['count', 'grouping', 'gene_names', 'cell_ids', 'milestone_network', 'root_milestone_id', 'type']) +``` + +## R + +```R +library(hdf5r) +file.h5 <- H5File$new('dentate.h5', mode='r') # open file +file.h5 # overview +names(file.h5) # keys of content +count <- t(file.h5[['count']][,]) # transpose to get num_cells*num_genes if necessary +file.h5$close_all() # close file +``` + + +# Field + +For the returned dict, all possible fields for these datasets are shown below. Note that not every dataset have all these fields. For example, `covariates` only available in the dataset in case studies. + + +key|detail +---|--- +count | A two-dim array of counts. +grouping | A one-dim array of reference labels of cells. +gene\_names | A one-dim array of gene names. +cell\_ids | A one-dim array of cell ids. +covariates | A two-dim array of covariates, e.g., cell-cycle scores and the indicator of data sources. +milestone_network | A dataframe of the reference connectivity network of cell types. For real data, it is a dataframe indicating the transition of each vertex with columns `from` and `to`. For synthetic data, it is a dataframe indicating the transition of each cell with columns `from`, `to` and `w`. +root\_milestone\_id | The name of the root vertex of the trajectory. +type | 'UMI' or 'non-UMI' + + + + + + + + +