Install packages

# install.packages("HIMA")
library(HIMA)
library(knitr)
library(ggplot2)
library(ggrepel)

Introduction to TCGA data (GBM) and causal diagram

469 patients of glioblastoma multiforme have complete genomic data on gene expression (UNC AgilentG4502A-07) archived in The Cancer Genome Atlas (TCGA). Chemotherapy have also been reported to be associated with survival of cancer patients.
Number of patients: 469
Hypothesis: chemotherapy affects survival outcome mainly through its influence on gene expression levels

exposure (X): chemotherapy (Yes/No)
mediator (M): gene expression (17,450 genes)
outcome (Y): dichotomous 1-year survival status
confounders (C): Age, Gender
Causal diagram

load("TCGA_GBM.RData")  # GBM.df
# remove genes with missing value
missing.col <- colSums(is.na(GBM.df)) > 0
GBM.df <- GBM.df[,!missing.col]
head(GBM.df[,1:7])
##                 OS.time OS Age gender chemo_therapy      A1BG    A2BP1
## TCGA-02-0001-01     358  1  44      0           YES  0.708875 2.420750
## TCGA-02-0003-01     144  1  50      1           YES  0.675500 2.489875
## TCGA-02-0006-01     558  1  56      0           YES  1.049125 1.367250
## TCGA-02-0007-01     705  1  40      0           YES  0.546500 0.631375
## TCGA-02-0010-01    1077  1  20      0           YES -0.358750 1.104250
## TCGA-02-0011-01     630  1  18      0           YES  1.490250 1.446125
## generate exposure, outcome, mediator, and confounders
# exposure and outcome: vector
# mediator and confounders: matrix
exposure <- ifelse(GBM.df$chemo_therapy == "YES", 1, 0)
survival.time <- GBM.df$OS.time
outcome <- ifelse(survival.time > 365, 1, 0)
mediator <- as.matrix(GBM.df[,-c(1:5)])
confounders <- as.matrix(GBM.df[,c("Age", "gender")])

Step 1: Sure independence screening (SIS)

Wse can write a mediation model as following two regression equations. \[\begin{align*} M &= C + \alpha X + \epsilon_M \\ Y &= C + \theta X + \beta M + \epsilon_Y \end{align*}\] The mediation effect is composed of the product of \(\alpha\) and \(\beta\), which represent X-M association and M-Y association respectively. We can first visualize signal strength of \(\alpha\) and \(\beta\) by p-value. represents 0.05 threshold.

p.alpha <- apply(mediator, 2,
                 function(x) summary(lm(x ~ exposure + confounders))$coefficients[2,4])
p.beta  <- apply(mediator, 2,
                 function(x) summary(glm(outcome ~ x + exposure + confounders, 
                                         family = "binomial"))$coefficients[2,4])
pvalue.df <- data.frame(geneID = colnames(mediator),
                        alpha = -log10(p.alpha), 
                        beta = -log10(p.beta))

scatter.plt <- ggplot(pvalue.df, aes(x=alpha, y=beta)) + 
                  geom_point() + 
                  geom_vline(xintercept=-log10(0.05), 
                             linetype="dashed", color = "red") +
                  geom_hline(yintercept=-log10(0.05), 
                             linetype="dashed", color = "red") +
                  xlab("log p-value of X-M association") + 
                  ylab("log p-value of M-Y association")
scatter.plt

Then, SIS (step 1) would identify the top d largest effects.
If Y is binary: \(d=\frac{n}{2log(n)}\) and screening \(\alpha\) effect
If Y is continuous: \(d=\frac{2n}{log(n)}\) and screening \(\beta\) effect
Since the sample size of TCGA GBM is 469, we will pre-select top 39 largest \(\alpha\) effects, which are highlighted in black.

n <- length(exposure)
d <- ceiling(n/(2*log(n)))
largest.alpha.index <- order(p.alpha)[1:d]
pvalue.df$SIS.selection <- rep("0", length(p.alpha))
pvalue.df$SIS.selection[largest.alpha.index] <- "1"
scatter.plt <- ggplot(pvalue.df, aes(x=alpha, y=beta, 
                                     colour = SIS.selection)) + 
                  geom_point() + 
                  scale_colour_manual(name="SIS",  
                      values = c("0"="gray", "1"="black")) +
                  geom_vline(xintercept=-log10(0.05), 
                             linetype="dashed", color = "red") +
                  geom_hline(yintercept=-log10(0.05), 
                             linetype="dashed", color = "red") +
                  xlab("log p-value of X-M association") + 
                  ylab("log p-value of M-Y association")
scatter.plt

HIMA would apply minimax concave penalty (MCP) or other penalty regression methods to estimate mediation effects and make inference based on these 39 mediators.

HIMA

Estimation

Outcome type:
- continuous: use hima function and set Y.family as “gaussian”
- binary: use hima function and set Y.family as “binomial”
- survival: use survHIMA function

HIMA.result <- hima(X = exposure,
                    Y = outcome,
                    M = mediator,
                    COV.XM = confounders,
                    COV.MY = confounders,
                    Y.family = "binomial",
                    M.family = "gaussian",
                    parallel = T,
                    ncore = 5,
                    penalty = "MCP")
kable(HIMA.result)
alpha beta gamma alpha*beta % total effect Bonferroni.p BH.FDR
DHRS12 0.2270880 -0.3282940 1.517597 -0.0745516 -4.9124789 0.0202180 0.0067547
NDUFA7 0.1575399 1.0372184 1.517597 0.1634033 10.7672359 0.0086816 0.0024884
OR52R1 -0.1427211 0.0156191 1.517597 -0.0022292 -0.1468885 0.3949076 0.0987269
PIGS -0.1480009 0.5642783 1.517597 -0.0835137 -5.5030204 0.0202642 0.0067547

Result: The MCP-penalized regression method finally identifies 4 genes, which are “DHRS12”, “NDUFA7”, “OR52R1”, and “PIG”. Based on joint significance test (MaxP test), only “OR52R1” is not significant (p-value > 0.05).

Explanation:
- alpha\(*\)beta: mediation effect
- gamma: total effect
- \(\%\) total effect: proportion of mediation effect, which is mediation effect\(/\)total effect
- Bonferroni.p & BH.FDR: p-value with Bonferroni correction or Benjamini-Hochberg correction

Conclusion: NDUFA7 has the strongest mediation effect among all genes, while both DHRS12 and PIGS can explain around 5% of total effect.

Visualization

Only few genes have significant mediation contribution.

highlight.df <- pvalue.df[pvalue.df$geneID %in% rownames(HIMA.result),]
highlight.df$colors <- ifelse(HIMA.result$Bonferroni.p < 0.05, "< 0.05", "> 0.05")
scatter.plt + geom_point(data= highlight.df, 
                         aes(x=alpha, y=beta, color=colors), 
                         size=3) +
              scale_colour_manual(name="HIMA",  
                      values = c("< 0.05"="red", "> 0.05"="blue")) +
              geom_text_repel(data=highlight.df, 
                              aes(color=colors, label=geneID), 
                              size = 5)

Reference: Zhang, Haixiang, et al. “Estimating and testing high-dimensional mediation effects in epigenetic studies.” Bioinformatics 32.20 (2016): 3150-3154.
DOI: 10.1093/bioinformatics/btw351