--- a +++ b/vignettes/BloodCancerMultiOmics2017-dataOverview.Rmd @@ -0,0 +1,294 @@ +--- +title: "BloodCancerMultiOmics2017 - data overview" +author: "Małgorzata Oleś" +output: + BiocStyle::html_document: + toc_float: true +vignette: > + %\VignetteIndexEntry{BloodCancerMultiOmics2017 - data overview} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +# Prerequisites + +```{r loadlib, message=FALSE} +library("BloodCancerMultiOmics2017") +# additional +library("Biobase") +library("SummarizedExperiment") +library("DESeq2") +library("reshape2") +library("ggplot2") +library("dplyr") +library("BiocStyle") +``` + + +# Introduction + +Primary tumor samples from blood cancer patients underwent functional and molecular characterization. `r Biocpkg("BloodCancerMultiOmics2017")` includes the resulting preprocessed data. A quick overview of the available data is provided below. For the details on experimental settings please refer to: + +S Dietrich\*, M Oleś\*, J Lu\* et al. *Drug-perturbation-based stratification of blood cancer* +<br> +*J. Clin. Invest.* (2018); 128(1):427–445. doi:10.1172/JCI93801. + +\* equal contribution + + +# Data overview + +Load all of the available data. +```{r} +data("conctab", "drpar", "lpdAll", "patmeta", "day23rep", "drugs", + "methData", "validateExp", "dds", "exprTreat", "mutCOM", + "cytokineViab") +``` + +The data sets are objects of different classes (`data.frame`, `ExpressionSet`, `NChannelSet`, `RangedSummarizedExperiment`, `DESeqDataSet`), and include data for either all studied patient samples or only a subset of these. The overview below shortly describes and summarizes the data available. Please note that the presence of a given patient sample ID within the data set doesn't necessarily mean that the data is available for this sample (the slot could be filled with NAs). + +Patient samples per data set. +```{r numberOfSamples} +samplesPerData = list( + drpar = colnames(drpar), + lpdAll = colnames(lpdAll), + day23rep = colnames(day23rep), + methData = colnames(methData), + patmeta = rownames(patmeta), + validateExp = unique(validateExp$patientID), + dds = colData(dds)$PatID, + exprTreat = unique(pData(exprTreat)$PatientID), + mutCOM = rownames(mutCOM), + cytokineViab = unique(cytokineViab$Patient) +) +``` + +List of all samples present in data sets. +```{r} +(samples = sort(unique(unlist(samplesPerData)))) +``` + +Total number of samples. +```{r} +length(samples) +``` + +A plot summarizing the presence of a given patient sample within each data set. +```{r sampleOverlap, fig.height=4, fig.width=8, echo=FALSE} +plotTab = melt(samplesPerData, value.name="PatientID") +plotTab$L1 = factor(plotTab$L1, levels=c("patmeta", + "mutCOM", + "lpdAll", + "methData", + "exprTreat", + "dds", + "cytokineViab", + "day23rep", + "validateExp", + "drpar")) + +# order of the samples in the plot +tmp = do.call(cbind, lapply(samplesPerData[c("drpar", + "validateExp", + "day23rep", + "dds", + "exprTreat", + "methData", + "cytokineViab")], + function(x) { + samples %in% x + })) + +rownames(tmp) = samples +ord = order(tmp[,1], tmp[,2], tmp[,3], tmp[,4], tmp[,5], tmp[,6], tmp[,7], + decreasing=TRUE) +ordSamples = rownames(tmp)[ord] +plotTab$PatientID = factor(plotTab$PatientID, levels=ordSamples) + +ggplot(plotTab, aes(x=PatientID, y=L1)) + geom_tile(fill="lightseagreen") + + scale_y_discrete(expand=c(0,0)) + + ylab("Data objects") + + xlab("Patient samples") + + geom_vline(xintercept=seq(10, length(samples),10), color="grey") + + geom_hline(yintercept=seq(0.5, length(levels(plotTab$L1)), 1), + color="dimgrey") + + theme(panel.grid=element_blank(), + text=element_text(size=18), + axis.text.x=element_blank(), + axis.ticks.x=element_blank(), + panel.background=element_rect(color="gainsboro")) +``` + +The classification below stratifies data sets according to different types of experiments performed and included. Please refer to the manual for a more detailed information on the content of these data objects. + + +## Patient metadata + +Patient metadata is provided in the `patmeta` object. +```{r} +# Number of patients per disease +sort(table(patmeta$Diagnosis), decreasing=TRUE) + +# Number of samples from pretreated patients +table(!patmeta$IC50beforeTreatment) + +# IGHV status of CLL patients +table(patmeta[patmeta$Diagnosis=="CLL", "IGHV"]) +``` + + +## High-throughput drug screen data + +The viability measurements from the high-throughput drug screen are included in the `drpar` object. The metadata about the drugs and drug concentrations used can be found in `drugs` and `conctab` objects, respectively. + +The `drpar` object includes multiple channels, each of which consists of cells' viability data for a single drug concentration step. Channels `viaraw.1_5` and `viaraw.4_5` contain the mean viability score between multiple concentration steps as indicated at the end of the channel name. + +```{r} +channelNames(drpar) + +# show viability data for the first 5 patients and 7 drugs in their lowest conc. +assayData(drpar)[["viaraw.1"]][1:7,1:5] +``` + +Drug metadata. +```{r} +# number of drugs +nrow(drugs) + +# type of information included in the object +colnames(drugs) +``` + +Drug concentration steps (c1 - lowest, c5 - highest). +```{r} +head(conctab) +``` + +The reproducibility of the screening platform was assessed by screening `r unname(ncol(day23rep))` patient samples in two replicates. The viability measurements are available for two time points: 48 h and 72 h after adding the drug. The screen was performed for `r length(unique(fData(day23rep)$DrugID))` drugs in 1-2 different drug concentrations (`r table(table(fData(day23rep)$DrugID))["1"]` in 1 and `r table(table(fData(day23rep)$DrugID))["2"]` in 2 drug concentrations). This data is provided in `day23rep`. +```{r} +channelNames(day23rep) + +# show viability data for 48 h time point for all patients marked as +# replicate 1 and 3 first drugs in all their conc. +drugs2Show = unique(fData(day23rep)$DrugID)[1:3] +assayData(day23rep)[["day2rep1"]][fData(day23rep)$DrugID %in% drugs2Show,] +``` + +The follow-up drug screen, which confirmed the targets and the signaling pathway dependence of the patient samples was performed for `r length(unique(validateExp$patientID))` samples and the following drugs: `r paste(unique(validateExp$Drug), collapse=", ")`. + +| Drug name | Target | +|-------------|--------| +| Cobimetinib | MEK | +| Trametinib | MEK | +| SCH772984 | ERK1/2 | +| Ganetespib | Hsp90 | +| Onalespib | Hsp90 | + +The data is included in the `validateExp` object. +```{r} +head(validateExp) +``` + +Moreover, we also performed a small drug screen in order to check the influence of the different cytokines/chemokines on the viability of the samples. These data are included in `cytokineViab` object. + +```{r} +head(cytokineViab) +``` + + +## Gene mutation data + +The `mutCOM` object contains information on the presence of gene mutations in the studied patient samples. +```{r} +# there is only one channel with the binary type of data for each gene +channelNames(mutCOM) + +# the feature data includes detailed information about mutations in +# TP53 and BRAF genes, as well as clone size of +#del17p13, KRAS, UMODL1, CREBBP, PRPF8, trisomy12 mutations +colnames(fData(mutCOM)) +``` + + +## Gene expression data + +RNA-Seq data preprocessed with `r Biocpkg("DESeq2")` is provided in the `dds` object. + +```{r} +# show count data for the first 5 patients and 7 genes +assay(dds)[1:7,1:5] + +# show the above with patient sample ids +assay(dds)[1:7,1:5] %>% `colnames<-` (colData(dds)$PatID[1:5]) + +# number of genes and patient samples +nrow(dds); ncol(dds) +``` + +Additionally, `r length(unique(pData(exprTreat)$PatientID))` patient samples underwent gene expression profiling using Illumina microarrays before and 12 h after treatment with `r tmp=unique(pData(exprTreat)$DrugID); length(tmp[!is.na(tmp)])` drugs. These data are included in the `exprTreat` data object. +```{r} +# patient samples included in the data set +(p = unique(pData(exprTreat)$PatientID)) + +# type of metadata included for each gene +colnames(fData(exprTreat)) + +# show expression level for the first patient and 3 first probes +Biobase::exprs(exprTreat)[1:3, pData(exprTreat)$PatientID==p[1]] +``` + + +## DNA methylation data + +DNA methylation included in `methData` object contains data for `r ncol(methData)` patient samples and 5000 of the most variable CpG sites. + +```{r} +# show the methylation for the first 7 CpGs and the first 5 patient samples +assay(methData)[1:7,1:5] + +# type of metadata included for CpGs +colnames(rowData(methData)) + +# number of patient samples screened with the given platform type +table(colData(methData)$platform) +``` + + +## Other + +Object `lpdAll` is a convenient assembly of data contained in the other data objects mentioned earlier in this vignette. For details, please refer to the manual. + +```{r} +# number of rows in the dataset for each type of data +table(fData(lpdAll)$type) + +# show viability data for drug ibrutinib, idelalisib and dasatinib +# (in the mean of the two lowest concentration steps) and +# the first 5 patient samples +Biobase::exprs(lpdAll)[which( + with(fData(lpdAll), + name %in% c("ibrutinib", "idelalisib", "dasatinib") & + subtype=="4:5")), 1:5] +``` + + +# Original data + +The raw data from the whole exome sequencing, RNA-seq and DNA methylation arrays is stored in the European Genome-Phenome Archive (EGA) under accession number EGAS0000100174. + +The preprocesed DNA methylation data, which include complete list of CpG sites (not only the 5000 with the highest variance) can be accessed through Bioconductor ExperimentHub platform. + +```{r eval=FALSE} +library("ExperimentHub") + +eh = ExperimentHub() +obj = query(eh, "CLLmethylation") +meth = obj[["EH1071"]] # extract the methylation data +``` + + +# Session info + +```{r} +sessionInfo() +```