DataSet
Project | Data type | Datasets | # of samples | # of features | Data & Code Path |
---|---|---|---|---|---|
Proof-of-Concept | Image data | MNIST[1] | 70K images including 10 classes, 60K training set, 10K test set. | 684 pixels | paper/00_mnist/correlation |
Proof-of-Concept | Image data | FMNIST[2] | 70K images including 10 classes, 60K training set, 10K test set. | 684 pixels | paper/01_fmnist/correlation |
Example to run | Breast Cancer Diagnostic | WDBC[3] | 569 samples that are labeled as 357 benign status and 212 malignant status. | 30 real-valued features of cell nucleus | paper/00_example_breast_cancer |
Pan Cancer | Transcriptomics | TCGA-T[4] | Total 10446 samples including 33 cancer types from Pan-Cancer Atlas, the number of samples for each class is ranged from 45 to 1212, with an average of 317. The number of samples for 15 tumor types are less than 200. | 10381 normalized-level3 RNA-Seq gene expression data | paper/02_transcriptome/CNN |
Pan Cancer | Transcriptomics | TCGA-S & TCGA-G[5] | It contains 18 subset datasets, each dataset is a binary task on a different cancer and different stages or grades, the number of samples for each task is ranged from 179 to 1134, with an average of 486. | 17970 “O” genes with Z-score transformed RNA-Seq gene expression data. | paper/02_transcriptome/ML |
COVID-19 | Proteomics | Cov-D[6] | 363 samples, 211 SARS-CoV-2 positives and 151 negatives that are from 3 different labs. | 88 nasal swabs MALDI-MS signal peaks | paper/03_COVID-19 |
COVID-19 | Proteomics & Metabolomics | Cov-S[7] | 41 patients, including 31 in training set (18 non-severe and 13 severe) and an independent cohort of 10 patients (6 non-severe and 4 severe). | 1486 markers from the sera samples, including 649 proteins and 847 metabolites | paper/03_COV19_Severe |