Switch to unified view

a b/vignettes/BloodCancerMultiOmics2017-dataOverview.Rmd
1
---
2
title: "BloodCancerMultiOmics2017 - data overview"
3
author: "Małgorzata Oleś"
4
output: 
5
  BiocStyle::html_document:
6
    toc_float: true
7
vignette: >
8
  %\VignetteIndexEntry{BloodCancerMultiOmics2017 - data overview}
9
  %\VignetteEngine{knitr::rmarkdown}
10
  %\VignetteEncoding{UTF-8} 
11
---
12
13
# Prerequisites
14
15
```{r loadlib, message=FALSE}
16
library("BloodCancerMultiOmics2017")
17
# additional
18
library("Biobase")
19
library("SummarizedExperiment")
20
library("DESeq2")
21
library("reshape2")
22
library("ggplot2")
23
library("dplyr")
24
library("BiocStyle")
25
```
26
27
28
# Introduction
29
30
Primary tumor samples from blood cancer patients underwent functional and molecular characterization. `r Biocpkg("BloodCancerMultiOmics2017")` includes the resulting preprocessed data. A quick overview of the available data is provided below. For the details on experimental settings please refer to:
31
32
S Dietrich\*, M Oleś\*, J Lu\* et al. *Drug-perturbation-based stratification of blood cancer*
33
<br>
34
*J. Clin. Invest.* (2018); 128(1):427–445. doi:10.1172/JCI93801. 
35
36
\* equal contribution
37
38
39
# Data overview
40
41
Load all of the available data.
42
```{r}
43
data("conctab", "drpar", "lpdAll", "patmeta", "day23rep", "drugs",
44
     "methData", "validateExp", "dds", "exprTreat", "mutCOM",
45
     "cytokineViab")
46
```
47
48
The data sets are objects of different classes (`data.frame`, `ExpressionSet`, `NChannelSet`, `RangedSummarizedExperiment`, `DESeqDataSet`), and include data for either all studied patient samples or only a subset of these. The overview below shortly describes and summarizes the data available. Please note that the presence of a given patient sample ID within the data set doesn't necessarily mean that the data is available for this sample (the slot could be filled with NAs).
49
50
Patient samples per data set.
51
```{r numberOfSamples}
52
samplesPerData = list(
53
  drpar = colnames(drpar),
54
  lpdAll = colnames(lpdAll),
55
  day23rep = colnames(day23rep),
56
  methData = colnames(methData),
57
  patmeta = rownames(patmeta),
58
  validateExp = unique(validateExp$patientID),
59
  dds = colData(dds)$PatID,
60
  exprTreat = unique(pData(exprTreat)$PatientID),
61
  mutCOM = rownames(mutCOM),
62
  cytokineViab = unique(cytokineViab$Patient)
63
)
64
```
65
66
List of all samples present in data sets.
67
```{r}
68
(samples = sort(unique(unlist(samplesPerData))))
69
```
70
71
Total number of samples.
72
```{r}
73
length(samples)
74
```
75
76
A plot summarizing the presence of a given patient sample within each data set.
77
```{r sampleOverlap, fig.height=4, fig.width=8, echo=FALSE}
78
plotTab = melt(samplesPerData, value.name="PatientID")
79
plotTab$L1 = factor(plotTab$L1, levels=c("patmeta",
80
                                         "mutCOM",
81
                                         "lpdAll",
82
                                         "methData",
83
                                         "exprTreat",
84
                                         "dds",
85
                                         "cytokineViab",
86
                                         "day23rep",
87
                                         "validateExp",
88
                                         "drpar"))
89
90
# order of the samples in the plot
91
tmp = do.call(cbind, lapply(samplesPerData[c("drpar",
92
                                             "validateExp",
93
                                             "day23rep",
94
                                             "dds",
95
                                             "exprTreat",
96
                                             "methData",
97
                                             "cytokineViab")],
98
                            function(x) {
99
                              samples %in% x
100
  }))
101
102
rownames(tmp) = samples
103
ord = order(tmp[,1], tmp[,2], tmp[,3], tmp[,4], tmp[,5], tmp[,6], tmp[,7],
104
            decreasing=TRUE)
105
ordSamples = rownames(tmp)[ord]
106
plotTab$PatientID = factor(plotTab$PatientID, levels=ordSamples)
107
108
ggplot(plotTab, aes(x=PatientID, y=L1)) + geom_tile(fill="lightseagreen") +
109
  scale_y_discrete(expand=c(0,0)) +
110
  ylab("Data objects") + 
111
  xlab("Patient samples") +
112
  geom_vline(xintercept=seq(10, length(samples),10), color="grey") +
113
  geom_hline(yintercept=seq(0.5, length(levels(plotTab$L1)), 1),
114
             color="dimgrey") +
115
  theme(panel.grid=element_blank(),
116
        text=element_text(size=18),
117
        axis.text.x=element_blank(),
118
        axis.ticks.x=element_blank(),
119
        panel.background=element_rect(color="gainsboro"))
120
```
121
122
The classification below stratifies data sets according to different types of experiments performed and included. Please refer to the manual for a more detailed information on the content of these data objects.
123
124
125
## Patient metadata
126
127
Patient metadata is provided in the `patmeta` object.
128
```{r}
129
# Number of patients per disease
130
sort(table(patmeta$Diagnosis), decreasing=TRUE)
131
132
# Number of samples from pretreated patients
133
table(!patmeta$IC50beforeTreatment)
134
135
# IGHV status of CLL patients
136
table(patmeta[patmeta$Diagnosis=="CLL", "IGHV"])
137
```
138
139
140
## High-throughput drug screen data
141
142
The viability measurements from the high-throughput drug screen are included in the `drpar` object. The metadata about the drugs and drug concentrations used can be found in `drugs` and `conctab` objects, respectively.
143
144
The `drpar` object includes multiple channels, each of which consists of cells' viability data for a single drug concentration step. Channels `viaraw.1_5` and `viaraw.4_5` contain the mean viability score between multiple concentration steps as indicated at the end of the channel name.
145
146
```{r}
147
channelNames(drpar)
148
149
# show viability data for the first 5 patients and 7 drugs in their lowest conc.
150
assayData(drpar)[["viaraw.1"]][1:7,1:5]
151
```
152
153
Drug metadata.
154
```{r}
155
# number of drugs
156
nrow(drugs)
157
158
# type of information included in the object
159
colnames(drugs)
160
```
161
162
Drug concentration steps (c1 - lowest, c5 - highest).
163
```{r}
164
head(conctab)
165
```
166
167
The reproducibility of the screening platform was assessed by screening `r unname(ncol(day23rep))` patient samples in two replicates. The viability measurements are available for two time points: 48 h and 72 h after adding the drug. The screen was performed for `r length(unique(fData(day23rep)$DrugID))` drugs in 1-2 different drug concentrations (`r table(table(fData(day23rep)$DrugID))["1"]` in 1 and `r table(table(fData(day23rep)$DrugID))["2"]` in 2 drug concentrations). This data is provided in `day23rep`.
168
```{r}
169
channelNames(day23rep)
170
171
# show viability data for 48 h time point for all patients marked as
172
# replicate 1 and 3 first drugs in all their conc.
173
drugs2Show = unique(fData(day23rep)$DrugID)[1:3]
174
assayData(day23rep)[["day2rep1"]][fData(day23rep)$DrugID %in% drugs2Show,]
175
```
176
177
The follow-up drug screen, which confirmed the targets and the signaling pathway dependence of the patient samples was performed for `r length(unique(validateExp$patientID))` samples and the following drugs: `r paste(unique(validateExp$Drug), collapse=", ")`.
178
179
| Drug name   | Target |
180
|-------------|--------|
181
| Cobimetinib | MEK    |
182
| Trametinib  | MEK    |
183
| SCH772984   | ERK1/2 |
184
| Ganetespib  | Hsp90  |
185
| Onalespib   | Hsp90  |
186
187
The data is included in the `validateExp` object.
188
```{r}
189
head(validateExp)
190
```
191
192
Moreover, we also performed a small drug screen in order to check the influence of the different cytokines/chemokines on the viability of the samples. These data are included in `cytokineViab` object.
193
194
```{r}
195
head(cytokineViab)
196
```
197
198
199
## Gene mutation data
200
201
The `mutCOM` object contains information on the presence of gene mutations in the studied patient samples.
202
```{r}
203
# there is only one channel with the binary type of data for each gene
204
channelNames(mutCOM)
205
206
# the feature data includes detailed information about mutations in
207
# TP53 and BRAF genes, as well as clone size of 
208
#del17p13, KRAS, UMODL1, CREBBP, PRPF8, trisomy12 mutations
209
colnames(fData(mutCOM))
210
```
211
212
213
## Gene expression data
214
215
RNA-Seq data preprocessed with `r Biocpkg("DESeq2")` is provided in the `dds` object.
216
217
```{r}
218
# show count data for the first 5 patients and 7 genes
219
assay(dds)[1:7,1:5]
220
221
# show the above with patient sample ids
222
assay(dds)[1:7,1:5] %>% `colnames<-` (colData(dds)$PatID[1:5])
223
224
# number of genes and patient samples
225
nrow(dds); ncol(dds)
226
```
227
228
Additionally, `r length(unique(pData(exprTreat)$PatientID))` patient samples underwent gene expression profiling using Illumina microarrays before and 12 h after treatment with `r tmp=unique(pData(exprTreat)$DrugID); length(tmp[!is.na(tmp)])` drugs. These data are included in the `exprTreat` data object.
229
```{r}
230
# patient samples included in the data set
231
(p = unique(pData(exprTreat)$PatientID))
232
233
# type of metadata included for each gene
234
colnames(fData(exprTreat))
235
236
# show expression level for the first patient and 3 first probes
237
Biobase::exprs(exprTreat)[1:3, pData(exprTreat)$PatientID==p[1]]
238
```
239
240
241
## DNA methylation data
242
243
DNA methylation included in `methData` object contains data for `r ncol(methData)` patient samples and 5000 of the most variable CpG sites.
244
245
```{r}
246
# show the methylation for the first 7 CpGs and the first 5 patient samples
247
assay(methData)[1:7,1:5]
248
249
# type of metadata included for CpGs
250
colnames(rowData(methData))
251
252
# number of patient samples screened with the given platform type
253
table(colData(methData)$platform)
254
```
255
256
257
## Other
258
259
Object `lpdAll` is a convenient assembly of data contained in the other data objects mentioned earlier in this vignette. For details, please refer to the manual. 
260
261
```{r}
262
# number of rows in the dataset for each type of data
263
table(fData(lpdAll)$type)
264
265
# show viability data for drug ibrutinib, idelalisib and dasatinib
266
# (in the mean of the two lowest concentration steps) and
267
# the first 5 patient samples
268
Biobase::exprs(lpdAll)[which(
269
  with(fData(lpdAll),
270
       name %in% c("ibrutinib", "idelalisib", "dasatinib") &
271
         subtype=="4:5")), 1:5]
272
```
273
274
275
# Original data
276
277
The raw data from the whole exome sequencing, RNA-seq and DNA methylation arrays is stored in the European Genome-Phenome Archive (EGA) under accession number EGAS0000100174.
278
279
The preprocesed DNA methylation data, which include complete list of CpG sites (not only the 5000 with the highest variance) can be accessed through Bioconductor ExperimentHub platform.
280
281
```{r eval=FALSE}
282
library("ExperimentHub")
283
284
eh = ExperimentHub()
285
obj = query(eh, "CLLmethylation")
286
meth = obj[["EH1071"]] # extract the methylation data
287
```
288
289
290
# Session info
291
292
```{r}
293
sessionInfo()
294
```