--- a +++ b/vignettes/exposureBiomarkers.Rmd @@ -0,0 +1,201 @@ +--- +title: "Identifying Biomarkers from an Exposure Variable with `biotmle`" +author: "Nima Hejazi" +date: "`r Sys.Date()`" +output: + BiocStyle::html_document +bibliography: ../inst/REFERENCES.bib +vignette: > + %\VignetteIndexEntry{Identifying Biomarkers from an Exposure Variable} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +## Introduction + +The `biotmle` R package can be used to isolate biomarkers in two ways: based +on the associations of genomic objects with an exposure variable of interest. +In this vignette, we illustrate how to use `biotmle` to isolate and visualize +genes associated with an __exposure__, using a data set containing microarray +expression measures from an Illumina platform. In the analysis described below, +targeted maximum likelihood estimation (TMLE) is used to transform the +microarray expression values based on the influence curve representation of the +Average Treatment Effect (ATE). Following this transformation, the moderated +t-statistic of Smyth [@smyth2004linear] is used to test for a binary group-wise +difference (based on the exposure variable), using the tools provided by the R +package [`limma`](https://bioconductor.org/packages/limma)). + +For a general discussion of the framework of targeted maximum likelihood +estimation and the role this approach plays in statistical causal inference, the +interested reader is invited to consult @vdl2011targeted and @vdl2018targeted. +For a more general introduction to the principles of statistical causal +inference, @pearl2000causality serves well. + +--- + +## Biomarker Identification + +First, we load the `biotmle` package and the (included) `illuminaData` data set: + +```{r setup_data, message=FALSE, warning=FALSE} +library(dplyr) +library(biotmle) +library(biotmleData) +library(BiocParallel) +library(SuperLearner) +library(SummarizedExperiment, quietly=TRUE) +data(illuminaData) +set.seed(13847) +``` + +In order to perform Targeted Minimum Loss-Based Estimation, we need three +separate data structures: (1) _W_, baseline covariates that could potentially +confound the association of biomarkers with the exposure of interest; (2) _A_, +the exposure of interest; and (3) _Y_, the biomarkers of interest. All values in +_W_ and _A_ ought to be discretized, in order to avoid practical violations of +the assumption of positivity. With the `illuminaData` data set below, we +discretize the age variable in the phenotype-level data below (this can be +accessed via the `colData` of the `SummarizedExperiment` object). To invoke the +biomarker assessment function (`biomarkertmle`), we also need to specify a +variable of interest (or the position of said variable in the design matrix). We +do both in just a few lines below: + +```{r clean_data} +# discretize "age" in the phenotype-level data +colData(illuminaData) <- colData(illuminaData) %>% + data.frame %>% + mutate(age = as.numeric(age > median(age))) %>% + DataFrame +benz_idx <- which(names(colData(illuminaData)) %in% "benzene") +``` + +The TMLE-based biomarker discovery process can be invoked using the +`biomarkertmle` function. The procedure is quite resource-intensive because it +evaluates the association of each individual potential biomarker (of which there +are over 20,000 in the included data set) with an exposure of interest, while +accounting for potential confounding based on all other covariates included in +the design matrix. We demonstrate the necessary syntax for calling the +`biomarkertmle` function below, on a small number of the probes: + +```{r biomarkerTMLE_eval, message=FALSE, warning=FALSE} +# compute TML estimates to evaluate differentially expressed biomarkers +biotmle_out <- biomarkertmle(se = illuminaData[1:20, ], + varInt = benz_idx, + g_lib = c("SL.mean", "SL.glm"), + Q_lib = c("SL.bayesglm", "SL.ranger"), + cv_folds = 2, + bppar_type = SerialParam() + ) +``` + +Note that parallelization is controlled entirely through the [`BiocParallel` +package](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html), +and we set `SerialParam()` here for _sequential_ evaluation. + +The output of `biomarkertmle` is an object of class `bioTMLE`, containing four +new slots: (1) `call`, the call to `biomarkertmle`; (2) `topTable`, an empty +slot meant to hold the output of `limma::topTable`, after a later call to +`modtest_ic`; and (3) `tmleOut`, a `data.frame` containing the point estimates +of the associations of each biomarker with the exposure of interest based on the +influence curve representation of the Average Treatment Effect. + +The output of `biomarkertmle` can be directly fed to `modtest_ic`, a wrapper +around `limma::lmFit` and `limma::topTable` that outputs a `biotmle` object +with the slots described above completely filled in. The `modtest_ic` function +requires as input a `biotmle` object containing a data frame in the `tmleOut` +field as well as a design matrix indicating the groupwise difference to be +tested. The design matrix should contain an intercept term and a term for the +exposure of interest (with discretized exposure levels). _Based on the relevant +statistical theory, it is not appropriate to include any further terms in the +design matrix (n.b., this differs from standard calls to `limma::lmFit`)_. + +```{r limmaTMLE_eval} +modtmle_out <- modtest_ic(biotmle = biotmle_out) +``` + +After invoking `modtest_ic`, the resultant `bioTMLE` object will contain all +information relevant to the analytic procedure for identifying biomarkers: that +is, it will contain the original call to `biomarkertmle`, the result of running +`limma::topTable`, and the result of running `biomarkertmle`. The statistical +results of this procedure can be extracted from the `topTable` object in the +`bioTMLE` object produced by `modtest_ic`. + +--- + +## Visualization of Results + +This package provides several plotting methods that can be used to visualize +the results of the TMLE-based biomarker discovery process. We demonstrate the +syntax for calling the generic plotting methods below but refrain from showing +the plots themselves since they are not particularly informative. + +The `plot` method for a `bioTMLE` object will produce a histogram of the +adjusted p-values of each biomarker (based on the Benjamini-Hochberg procedure +for controlling the False Discovery Rate) as generated by `limma::topTable`: + +```{r pval_hist_limma_adjp, eval=FALSE} +plot(x = modtmle_out, type = "pvals_adj") +``` + +Setting the argument `type = "pvals_raw"` will instead produce a histogram of +the raw p-values _(these are less informative and should, in general, not be +used for inferential purposes, as the computation producing these p-values +ignores the multiple testing nature of the biomarker discovery problem)_: + +```{r pval_hist_limma_rawp, eval=FALSE} +plot(x = modtmle_out, type = "pvals_raw") +``` + +Heatmaps are useful graphics for visualizing the relationship between measures +on genomic objects and covariates of interest. The `heatmap_ic` function +provides this graphic for `bioTMLE` objects, allowing for the relationship +between the exposure variable and some number of "top" biomarkers (as +determined by the call to `modtest_ic`) to be visualized. In general, the +heatmap for `bioTMLE` objects expresses how the contributions of each biomarker +to the Average Treatment Effect (ATE) vary across differences in the exposure +variable (that is, there is a causal interpretation to the findings). The plot +produced is a `ggplot2` object and can be modified in place if stored properly. +For our analysis: + +```{r heatmap_limma_results} +benz_idx <- which(names(colData(illuminaData)) %in% "benzene") +designVar <- as.data.frame(colData(illuminaData))[, benz_idx] +designVar <- as.numeric(designVar == max(designVar)) + +# build heatmap +heatmap_ic(x = modtmle_out, left.label = "none", scale = TRUE, + clustering.method = "hierarchical", row.dendrogram = TRUE, + design = designVar, FDRcutoff = 1, top = 10) +``` + +The heatmap produced in this way is actually a form of _supervised clustering_, +as described more generally (as _supervised distance matrices_) by +@pollard2008supervised, wherein the notion of deriving clustering procedures +from the results of supervised learning methods is formulated. Since the heatmap +is based on the contributions of observations to the efficient influence +function (EIF) of the target parameter, it directly visualizes the degree to +which each biomarker informs the difference (due to the treatment effect) +represented by the average treatment effect. + +The volcano plot is standard graphical tools for examining how changes in +expression relate to the raw p-value. The utility of such plots lies in their +providing a convenient way to identify and systematically ignore those genomic +objects that have extremely low p-values due to extremely low variance between +observations. The `volcano_ic` function provides much of the same +interpretation, except that the fold change values displayed in the x-axis refer +to changes in the _contributions of each biomarker to the Average Treatment +Effect_ (in standard practice, for microarray technology, these would be fold +changes in gene expression). The plot produced is a `ggplot2` object and, as +such, can be modified in place. For our analysis: + +```{r volcano_plot_limma_results} +volcano_ic(biotmle = modtmle_out) +``` + +--- + +## Session Information + +```{r sessionInfo, echo=FALSE} +sessionInfo() +```