# install.packages("HIMA")
library(HIMA)
library(knitr)
library(ggplot2)
library(ggrepel)
469 patients of glioblastoma multiforme have complete genomic data on
gene expression (UNC AgilentG4502A-07) archived in The Cancer Genome
Atlas (TCGA). Chemotherapy have also been reported to be associated with
survival of cancer patients.
Number of patients: 469
Hypothesis: chemotherapy affects survival outcome mainly through
its influence on gene expression levels
exposure (X): chemotherapy (Yes/No)
mediator (M): gene expression (17,450 genes)
outcome (Y): dichotomous 1-year survival status
confounders (C): Age, Gender
load("TCGA_GBM.RData") # GBM.df
# remove genes with missing value
missing.col <- colSums(is.na(GBM.df)) > 0
GBM.df <- GBM.df[,!missing.col]
head(GBM.df[,1:7])
## OS.time OS Age gender chemo_therapy A1BG A2BP1
## TCGA-02-0001-01 358 1 44 0 YES 0.708875 2.420750
## TCGA-02-0003-01 144 1 50 1 YES 0.675500 2.489875
## TCGA-02-0006-01 558 1 56 0 YES 1.049125 1.367250
## TCGA-02-0007-01 705 1 40 0 YES 0.546500 0.631375
## TCGA-02-0010-01 1077 1 20 0 YES -0.358750 1.104250
## TCGA-02-0011-01 630 1 18 0 YES 1.490250 1.446125
## generate exposure, outcome, mediator, and confounders
# exposure and outcome: vector
# mediator and confounders: matrix
exposure <- ifelse(GBM.df$chemo_therapy == "YES", 1, 0)
survival.time <- GBM.df$OS.time
outcome <- ifelse(survival.time > 365, 1, 0)
mediator <- as.matrix(GBM.df[,-c(1:5)])
confounders <- as.matrix(GBM.df[,c("Age", "gender")])
Wse can write a mediation model as following two regression equations. \[\begin{align*} M &= C + \alpha X + \epsilon_M \\ Y &= C + \theta X + \beta M + \epsilon_Y \end{align*}\] The mediation effect is composed of the product of \(\alpha\) and \(\beta\), which represent X-M association and M-Y association respectively. We can first visualize signal strength of \(\alpha\) and \(\beta\) by p-value. represents 0.05 threshold.
p.alpha <- apply(mediator, 2,
function(x) summary(lm(x ~ exposure + confounders))$coefficients[2,4])
p.beta <- apply(mediator, 2,
function(x) summary(glm(outcome ~ x + exposure + confounders,
family = "binomial"))$coefficients[2,4])
pvalue.df <- data.frame(geneID = colnames(mediator),
alpha = -log10(p.alpha),
beta = -log10(p.beta))
scatter.plt <- ggplot(pvalue.df, aes(x=alpha, y=beta)) +
geom_point() +
geom_vline(xintercept=-log10(0.05),
linetype="dashed", color = "red") +
geom_hline(yintercept=-log10(0.05),
linetype="dashed", color = "red") +
xlab("log p-value of X-M association") +
ylab("log p-value of M-Y association")
scatter.plt
Then, SIS (step 1) would identify the top d largest effects.
If Y is binary: \(d=\frac{n}{2log(n)}\)
and screening \(\alpha\) effect
If Y is continuous: \(d=\frac{2n}{log(n)}\) and screening \(\beta\) effect
Since the sample size of TCGA GBM is 469, we will pre-select top 39
largest \(\alpha\) effects, which are
highlighted in black.
n <- length(exposure)
d <- ceiling(n/(2*log(n)))
largest.alpha.index <- order(p.alpha)[1:d]
pvalue.df$SIS.selection <- rep("0", length(p.alpha))
pvalue.df$SIS.selection[largest.alpha.index] <- "1"
scatter.plt <- ggplot(pvalue.df, aes(x=alpha, y=beta,
colour = SIS.selection)) +
geom_point() +
scale_colour_manual(name="SIS",
values = c("0"="gray", "1"="black")) +
geom_vline(xintercept=-log10(0.05),
linetype="dashed", color = "red") +
geom_hline(yintercept=-log10(0.05),
linetype="dashed", color = "red") +
xlab("log p-value of X-M association") +
ylab("log p-value of M-Y association")
scatter.plt
HIMA would apply minimax concave penalty (MCP) or other penalty
regression methods to estimate mediation effects and make inference
based on these 39 mediators.
Outcome type:
- continuous: use hima function and set Y.family as “gaussian”
- binary: use hima function and set Y.family as “binomial”
- survival: use survHIMA function
HIMA.result <- hima(X = exposure,
Y = outcome,
M = mediator,
COV.XM = confounders,
COV.MY = confounders,
Y.family = "binomial",
M.family = "gaussian",
parallel = T,
ncore = 5,
penalty = "MCP")
kable(HIMA.result)
alpha | beta | gamma | alpha*beta | % total effect | Bonferroni.p | BH.FDR | |
---|---|---|---|---|---|---|---|
DHRS12 | 0.2270880 | -0.3282940 | 1.517597 | -0.0745516 | -4.9124789 | 0.0202180 | 0.0067547 |
NDUFA7 | 0.1575399 | 1.0372184 | 1.517597 | 0.1634033 | 10.7672359 | 0.0086816 | 0.0024884 |
OR52R1 | -0.1427211 | 0.0156191 | 1.517597 | -0.0022292 | -0.1468885 | 0.3949076 | 0.0987269 |
PIGS | -0.1480009 | 0.5642783 | 1.517597 | -0.0835137 | -5.5030204 | 0.0202642 | 0.0067547 |
Result: The MCP-penalized regression method finally
identifies 4 genes, which are “DHRS12”, “NDUFA7”, “OR52R1”, and “PIG”.
Based on joint significance test (MaxP test), only “OR52R1” is not
significant (p-value > 0.05).
Explanation:
- alpha\(*\)beta:
mediation effect
- gamma: total effect
- \(\%\) total effect:
proportion of mediation effect, which is mediation effect\(/\)total effect
- Bonferroni.p & BH.FDR: p-value
with Bonferroni correction or Benjamini-Hochberg correction
Conclusion: NDUFA7 has the strongest mediation effect
among all genes, while both DHRS12 and PIGS can explain around 5% of
total effect.
Only few genes have significant mediation contribution.
highlight.df <- pvalue.df[pvalue.df$geneID %in% rownames(HIMA.result),]
highlight.df$colors <- ifelse(HIMA.result$Bonferroni.p < 0.05, "< 0.05", "> 0.05")
scatter.plt + geom_point(data= highlight.df,
aes(x=alpha, y=beta, color=colors),
size=3) +
scale_colour_manual(name="HIMA",
values = c("< 0.05"="red", "> 0.05"="blue")) +
geom_text_repel(data=highlight.df,
aes(color=colors, label=geneID),
size = 5)
Reference: Zhang, Haixiang, et al. “Estimating and testing
high-dimensional mediation effects in epigenetic studies.”
Bioinformatics 32.20 (2016): 3150-3154.
DOI: 10.1093/bioinformatics/btw351