a b/Lecture 3/lecture_3.Rmd
1
---
2
title: "Multi-Omic Dimension Reduction"
3
author: "Jack Pattee"
4
date: "4/18/2023"
5
output:
6
  html_document:
7
    code_folding: hide
8
  pdf_document: default
9
---
10
11
```{r setup, include=FALSE}
12
knitr::opts_chunk$set(echo = TRUE)
13
14
###load relevant packages
15
library(r.jive)
16
library(cluster)
17
library(mclust)
18
library(TCGAbiolinks)
19
library(survival)
20
library(ggplot2)
21
library(ggfortify)
22
23
set.seed(3308)
24
```
25
26
```{r}
27
###define local path to save out plot results
28
localPath = "~/Documents/Miscellaneous/SISG_plots/"
29
30
###Load multi-omic data breast cancer tumor data from The Cancer Genome Atlas
31
data(BRCA_data)
32
33
###Gene expression data
34
expressionDat = Data$Expression
35
36
###Methylation data
37
methylationDat = Data$Methylation
38
39
###micro-RNA data
40
mirnaDat = Data$miRNA
41
42
###run JIVE
43
###too slow for live session, but determines the tuning parameters
44
#jive(Data)
45
###with the tuning parameters already determined
46
results = jive(Data,method="given",rankJ=2,rankA=c(27,26,25))
47
48
49
###use TCGA biolinks to get clinical phenotype data
50
dataClin <- GDCquery_clinic(project = "TCGA-BRCA","clinical") 
51
###get tumor subtype from 'morphology' variable
52
###8500/3: Invasive breast carcinoma of no special type (aka Ductal Carcinoma) (estimated ~75%)
53
###8520/3: Lobular carcinoma (estimated ~12 %)
54
###call the rest 'other'
55
###do some formatting and matching
56
sampFormatted = rep(NA,ncol(expressionDat))
57
for(i in 1:ncol(expressionDat)){
58
  tempSplit = strsplit(colnames(expressionDat)[i],"\\.")
59
  sampFormatted[i] = paste0(tempSplit[[1]][1],"-",tempSplit[[1]][2],"-",tempSplit[[1]][3])
60
}
61
m = match(sampFormatted,dataClin$submitter_id)
62
subtypeVec = dataClin$morphology[m]
63
###Define two variables
64
###First: 1 for DC, 0 for other
65
###Second: 1 for DC, 2 for lobular carcinoma, 0 for other
66
subtypeTwo = rep(0, length(subtypeVec))
67
subtypeTwo[subtypeVec=="8500/3"] = 1
68
subtypeThree = rep(0, length(subtypeVec))
69
subtypeThree[subtypeVec=="8500/3"] = 1
70
subtypeThree[subtypeVec=="8520/3"] = 2
71
72
###define survival endpoint, censored by 'days to last follow up'
73
survivalTime = dataClin$days_to_death[m]
74
survivalInd = rep(1, length(survivalTime))
75
survivalInd[which(is.na(survivalTime))] = 0
76
survivalTime[which(is.na(survivalTime))] = dataClin$days_to_last_follow_up[m][which(is.na(survivalTime))]
77
###define age variable
78
age = dataClin$age_at_index[m]
79
80
```
81
82
This document details an application of dimension reduction methods to multi-omic data from a sample of breast cancer tumors collected via The Cancer Genome Atlas. Data is available for `r ncol(expressionDat)` subjects. Data is available for three genomic variables: gene expression (with `r nrow(expressionDat)` measured genes), methylation (with `r nrow(methylationDat)` probes), and micro RNA (with `r nrow(mirnaDat)` molecules). Our goal is to explore the distribution of this data via dimension reduction approaches. We will use the JIVE method, which considers 'omic-specific and mixed factors, and stacked PCA, which does not.
83
84
Although dimension reduction is a fundamentally unsupervised method, we demonstrate the utility of the dimension reduction approache with respect to some clinical variables. In particular, we consider tumor subtype and survival time. Tumor subtype is considered in two forms: dichotomized into 'ductal carcinoma' and 'other', and trichotomized into 'ductal carcinoma', 'lobular carcinoma', and 'other'. Survival is defined as the number of days to death, and is censored by the number of days to last follow up.
85
86
```{r}
87
###describe the distribution of tumor subtype
88
print(summary(factor(subtypeThree, levels = c(0,1,2), labels = c("Other", "Ductal Carcinoma", "Lobular Carcinoma"))))
89
90
###plot the survival curve
91
autoplot(survfit(Surv(survivalTime,survivalInd)~1)) + xlab("Time (days)") + ylab("Survival")
92
```
93
94
Jive uses a permutation methodology to parse structure from residual noise, i.e., to determine the number of factors to use in the reconstruction of the joint structure and the per-omic individual structure. This permutation process can be run with the commented-out code in the code block below; however, to reduce processing time, the selected ranks have been input manually. 
95
96
Show the variance explained in each of the omic data types.
97
98
```{r}
99
###show variance explained
100
showVarExplained(results) 
101
```
102
103
Plot the heatmap of joint and individual variance. This needs to be saved out to a file as it does not display well within R or RMarkdown. Red represents positive values, blue represents negative values.
104
105
```{r}
106
png(paste0(localPath,"HeatmapsBRCA.png"),height=700,width=850)
107
showHeatmaps(results)
108
dev.off() 
109
```
110
111
Visualize the joint variation projected into two dimensions, with points colored according to the trichotomous representation of tumor subtype.
112
113
```{r show-fig}
114
Colors = rep('black',348)
115
Colors[subtypeThree==2] = 'green'
116
Colors[subtypeThree==0] = 'purple'
117
showPCA(results,n_joint=2,Colors=Colors)
118
legend(x = "topright", # Position
119
       legend = c("DC", "LC", "Other"),   # Legend texts
120
       pch = c(1,1,1),
121
       col = c("black","green","purple"))  # Line colors  
122
```
123
124
Plot the 1-dimensional reduction of each data type-specific variation against one another to look for trends.
125
126
```{r, fig.height=9, fig.width=13, fig.align='center'}
127
showPCA(results,n_joint=1,n_indiv=c(1,1,1),Colors=Colors)
128
```
129
130
The above plots are generated via r.jive plotting functions. To have greater control of our plotting and modeling options, we extract the joint and individual factors from the JIVE decomposition as below. We will then investigate clustering and association modeling approaches, and compare how using JIVE factors perform versus PCs from stacked PCA.
131
132
```{r}
133
rankJV <- results$rankJ
134
J<-numeric(0)
135
ng<-0
136
137
###get joint factors for use in plotting, clustering
138
for(j in 1:length(Data)){
139
  J <- rbind(J,results$joint[[j]]);
140
  ng<-c(ng,dim(results$joint[[j]])[1])
141
}
142
svd.o <- svd(J)
143
jV <- svd.o$v %*% diag(svd.o$d);
144
factorMat=jV[,1:rankJV]
145
146
###get factor loadings for individual data
147
###code ported from jive.r 'showPCA()' function
148
n_joint = 0
149
n_indiv = c(2,2,2)
150
l <- length(results$data)
151
152
nPCs = sum(n_indiv)
153
indivPCs = matrix(nrow=nPCs,ncol = dim(results$data[[1]])[2])
154
PC_names = rep('',nPCs)
155
156
for(i in 1:l){
157
  if(n_indiv[i]>0){
158
    SVD = svd(results$individual[[i]],nu=n_indiv[i],nv=n_indiv[i])
159
    indices = (n_joint+sum(n_indiv[0:(i-1)])+1):(n_joint+sum(n_indiv[0:i]))
160
    indivPCs[indices,] = diag(SVD$d)[1:n_indiv[i],1:n_indiv[i]]%*%t(SVD$v[,1:n_indiv[i]])
161
    PC_names[indices] = paste(names(results$data)[i]," Indiv ",1:n_indiv[i]) 
162
  }}
163
indivPCs = t(indivPCs)
164
colnames(indivPCs) = PC_names
165
166
```
167
168
To test whether this process was performed correctly, we can plot the first two joint factors against one another and compare to the 'showPCA()' plot from r.jive. These plots should be identical. We see that they are.
169
170
```{r}
171
###this should match the 'showPCA' plot above
172
plot(factorMat[,2],factorMat[,1],col = Colors, xlab = "Joint Factor 2", ylab = "Joint Factor 1")
173
```
174
175
We will now explore some clustering analyses using the JIVE factors, and compare these approaches to clustering using the stacked principal components. First, we will compare clustering using the two joint factors from JIVE and the first two PCs from stacked principal components. We will first investigate k-means clustering and use the gap statistic to determine the optimal number of clusters.
176
177
```{r}
178
###normalize JIVE factors for modeling 
179
factorMat = scale(factorMat)
180
indivPCs = scale(indivPCs)
181
182
###scale and norm each data type before PCA
183
scaleExp = scale(t(expressionDat),scale = FALSE)
184
scaleMeth = scale(t(methylationDat),scale = FALSE)
185
scaleMi = scale(t(mirnaDat),scale = FALSE)
186
stackedData = cbind(scaleExp/sum(abs(scaleExp)),
187
                    scaleMeth/sum(abs(scaleMeth)),
188
                    scaleMi/sum(abs(scaleMi)))
189
stackedPcs = prcomp(stackedData, retx = TRUE)
190
stackedPcFactors = scale(stackedPcs$x)
191
192
jiveGap = clusGap(factorMat, FUN = kmeans, nstart = 10, K.max = 10, B = 10)
193
plot(jiveGap, main = "Gap Statistic: JIVE, k-means, two features")
194
195
stackedGap = clusGap(stackedPcFactors[,1:2], FUN = kmeans, nstart = 10, K.max = 10, B = 10)
196
plot(stackedGap, main = "Gap statistic: stacked PCs, k-means, two features")
197
```
198
199
Plot the first and second joint factor against the first and second joint PC.
200
201
```{r}
202
plot(factorMat[,1],stackedPcFactors[,1],xlab = "JIVE Factor 1", ylab = "Stacked PC 1")
203
legend("topleft", legend = paste0("r: ",round(cor(factorMat[,1],stackedPcFactors[,1]),2)))
204
205
plot(factorMat[,2],stackedPcFactors[,2],xlab = "JIVE Factor 2", ylab = "Stacked PC 2")
206
legend("topleft", legend = paste0("r: ",round(cor(factorMat[,2],stackedPcFactors[,2]),2)))
207
```
208
209
It appears that three clusters is optimal for both approaches. To investigate the 'quality' of these clusterings, we will compare the three cluster assignments to the trichotomous representation of tumor subtypes. Note: it could be that the clusterings are representative of structure from a different source, so this comparison is by no means an 'absolute' measure of clustering quality.
210
211
```{r}
212
jiveClust = kmeans(factorMat,centers = 3, nstart = 5)
213
print(paste0("Adjusted Rand index for two-factor JIVE k-means clustering vs trichotomous tumor subtype: ",round(adjustedRandIndex(subtypeThree,jiveClust$cluster), 2)))
214
table(data.frame(Truth = subtypeThree, JIVE = jiveClust$cluster))
215
216
stackedClust = kmeans(stackedPcFactors[,1:2],centers = 3, nstart = 5)
217
print(paste0("Adjusted Rand index for two-factor stacked PCA k-means clustering vs trichotomous tumor subtype: ",round(adjustedRandIndex(subtypeThree,stackedClust$cluster), 2)))
218
table(data.frame(Truth = subtypeThree, Stacked = stackedClust$cluster))
219
220
table(data.frame(Stacked = stackedClust$cluster, JIVE = jiveClust$cluster))
221
```
222
223
K-means clustering assumes spherical cluster shape, which may not be the case in our data. Allow for a more flexible clustering with Gaussian mixture models, with the BIC used to select the number of clusters and the covariance structure.
224
225
```{r}
226
jiveGmmBic = mclustBIC(factorMat, G = c(1:10))
227
plot(jiveGmmBic, main = "BIC: Jive, GMM, two features")
228
summary(jiveGmmBic)
229
230
jiveGmmClust = Mclust(factorMat,x = jiveGmmBic)
231
print(paste0("Adjusted Rand index for two-factor JIVE GMM clustering vs trichotomous tumor subtype: ",round(adjustedRandIndex(subtypeThree,jiveGmmClust$classification), 2)))
232
table(data.frame(Truth = subtypeThree, JIVE = jiveGmmClust$classification))
233
plot(jiveGmmClust, what = "classification", main = "GMM Classification: JIVE, two features")
234
235
pcGmmBic = mclustBIC(stackedPcFactors[,1:2], G = c(1:10))
236
plot(pcGmmBic, main = "BIC: Stacked PCA, GMM, two features")
237
summary(pcGmmBic)
238
239
pcGmmClust = Mclust(stackedPcs$x[,1:2],x = pcGmmBic)
240
print(paste0("Adjusted Rand index for two-factor stacked PCA GMM clustering vs trichotomous tumor subtype: ",round(adjustedRandIndex(subtypeThree,pcGmmClust$classification), 2)))
241
table(data.frame(Truth = subtypeThree, Stacked = pcGmmClust$classification))
242
plot(pcGmmClust, what = "classification", main = "GMM Classification: stacked PCA, two features")
243
244
print("Compare clustering assignments for JIVE and stacked PCA: three clusters")
245
tempDatThree = data.frame(Stacked = pcGmmClust$classification, JIVE = jiveGmmClust$classification)
246
table(tempDatThree)
247
```
248
249
It appears that JIVE performs somewhat better than stacked PCs; however, the adjusted rand indices are still quite low. Let's investigate clustering with two clusters, and see how this compares to the dichotomous representation of tumor subtypes. We will again use Gaussian mixture modeling given the apparently non-spherical cluster structure.
250
251
```{r}
252
jiveGmmBicTwo = mclustBIC(factorMat,G = 2)
253
#summary(jiveGmmBicTwo)
254
jiveGmmClustTwo = Mclust(factorMat,x = jiveGmmBicTwo)
255
print(paste0("Adjusted Rand index for two-factor JIVE GMM clustering vs dichotomous tumor subtype: ",round(adjustedRandIndex(subtypeTwo,jiveGmmClustTwo$classification), 2)))
256
table(data.frame(Truth = subtypeTwo,JIVE = jiveGmmClustTwo$classification))
257
plot(jiveGmmClustTwo, what = "classification", main = "GMM Classification: JIVE, two features")
258
259
pcGmmBicTwo = mclustBIC(stackedPcFactors[,1:2],G = 2)
260
#summary(pcGmmBicTwo)
261
pcGmmClustTwo = Mclust(stackedPcFactors[,1:2],x = pcGmmBicTwo)
262
print(paste0("Adjusted Rand index for two-factor stacked PCA GMM clustering vs dichotomous tumor subtype: ",round(adjustedRandIndex(subtypeTwo,pcGmmClustTwo$classification), 2)))
263
table(data.frame(Truth = subtypeTwo, Stacked = pcGmmClustTwo$classification))
264
plot(pcGmmClustTwo, what = "classification", main = "GMM Classification: stacked PCA, two features")
265
266
print("Compare clustering assignments for JIVE and stacked PCA: two clusters")
267
tempDatTwo = data.frame(Stacked = pcGmmClustTwo$classification, JIVE = jiveGmmClustTwo$classification)
268
table(tempDatTwo)
269
```
270
271
Let's see if adding additional factors changes the clustering assignment at all. We will add the first data-specific factor for each of the three data types in JIVE, and the three subsequent PCs for stacked PCA. Compare to the dichotomous representation of tumor subtype.
272
273
```{r}
274
jive5Factors = cbind(factorMat,indivPCs[,c(1,3,5)])
275
jive5GmmBic = mclustBIC(jive5Factors, G = 2)
276
#summary(jive5GmmBic)
277
jive5GmmClust = Mclust(jive5Factors,x = jive5GmmBic)
278
print(paste0("Adjusted Rand index for five-factor JIVE GMM clustering vs dichotomous tumor subtype: ",round(adjustedRandIndex(subtypeTwo,jive5GmmClust$classification), 2)))
279
table(data.frame(Truth = subtypeTwo,JIVE = jive5GmmClust$classification))
280
281
pc5GmmBic = mclustBIC(stackedPcFactors[,1:5], G = 2)
282
#summary(pc5GmmBic)
283
pc5GmmClust = Mclust(stackedPcFactors[,1:5],x = pc5GmmBic)
284
print(paste0("Adjusted Rand index for five-factor stacked PCA GMM clustering vs dichotomous tumor subtype: ",round(adjustedRandIndex(subtypeTwo,pc5GmmClust$classification), 2)))
285
table(data.frame(Truth = subtypeTwo, Stacked = pc5GmmClust$classification))
286
```
287
288
Estimate logistic regression models to associate derived features with the dichotomous representation of tumor subtype. Age is also included as a covariate.
289
290
```{r}
291
jiveDat = cbind.data.frame(subtypeTwo,jive5Factors,age)
292
colnames(jiveDat) = c("y","joint1","joint2","express1","meth1","mirna1","age")
293
jiveGlm = glm(y~.,data = jiveDat, family = "binomial")
294
print(summary(jiveGlm))
295
296
pcDat = cbind.data.frame(subtypeTwo, stackedPcFactors[,1:5],age)
297
colnames(pcDat) = c("y",paste0("pc",1:5),"age")
298
pcMod = glm(y~.,data = pcDat, family = "binomial")
299
summary(pcMod)
300
```
301
302
Estimate cox proportional hazards models to associate derived features with survival outcome. Age is also included as a covariate.
303
304
```{r}
305
jiveDatSurv = cbind.data.frame(survivalTime,survivalInd,jive5Factors,age)
306
colnames(jiveDatSurv) = c("survTime","survInd","joint1","joint2","express1","meth1","mirna1","age")
307
jiveModPh = coxph(Surv(survTime,survInd)~., data = jiveDatSurv)
308
summary(jiveModPh)
309
310
pcDatSurv = cbind.data.frame(survivalTime,survivalInd,stackedPcFactors[,1:5],age)
311
colnames(pcDatSurv) = c("survTime","survInd",paste0("pc",1:5),"age")
312
pcModPh = coxph(Surv(survTime,survInd)~., data = pcDatSurv)
313
summary(pcModPh)
314
```