--- a
+++ b/vignettes/exposureBiomarkers.Rmd
@@ -0,0 +1,201 @@
+---
+title: "Identifying Biomarkers from an Exposure Variable with `biotmle`"
+author: "Nima Hejazi"
+date: "`r Sys.Date()`"
+output:
+  BiocStyle::html_document
+bibliography: ../inst/REFERENCES.bib
+vignette: >
+  %\VignetteIndexEntry{Identifying Biomarkers from an Exposure Variable}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+## Introduction
+
+The `biotmle` R package can be used to isolate biomarkers in two ways: based
+on the associations of genomic objects with an exposure variable of interest.
+In this vignette, we illustrate how to use `biotmle` to isolate and visualize
+genes associated with an __exposure__, using a data set containing microarray
+expression measures from an Illumina platform. In the analysis described below,
+targeted maximum likelihood estimation (TMLE) is used to transform the
+microarray expression values based on the influence curve representation of the
+Average Treatment Effect (ATE). Following this transformation, the moderated
+t-statistic of Smyth [@smyth2004linear] is used to test for a binary group-wise
+difference (based on the exposure variable), using the tools provided by the R
+package [`limma`](https://bioconductor.org/packages/limma)).
+
+For a general discussion of the framework of targeted maximum likelihood
+estimation and the role this approach plays in statistical causal inference, the
+interested reader is invited to consult @vdl2011targeted and @vdl2018targeted.
+For a more general introduction to the principles of statistical causal
+inference, @pearl2000causality serves well.
+
+---
+
+## Biomarker Identification
+
+First, we load the `biotmle` package and the (included) `illuminaData` data set:
+
+```{r setup_data, message=FALSE, warning=FALSE}
+library(dplyr)
+library(biotmle)
+library(biotmleData)
+library(BiocParallel)
+library(SuperLearner)
+library(SummarizedExperiment, quietly=TRUE)
+data(illuminaData)
+set.seed(13847)
+```
+
+In order to perform Targeted Minimum Loss-Based Estimation, we need three
+separate data structures: (1) _W_, baseline covariates that could potentially
+confound the association of biomarkers with the exposure of interest; (2) _A_,
+the exposure of interest; and (3) _Y_, the biomarkers of interest. All values in
+_W_ and _A_ ought to be discretized, in order to avoid practical violations of
+the assumption of positivity. With the `illuminaData` data set below, we
+discretize the age variable in the phenotype-level data below (this can be
+accessed via the `colData` of the `SummarizedExperiment` object). To invoke the
+biomarker assessment function (`biomarkertmle`), we also need to specify a
+variable of interest (or the position of said variable in the design matrix). We
+do both in just a few lines below:
+
+```{r clean_data}
+# discretize "age" in the phenotype-level data
+colData(illuminaData) <- colData(illuminaData) %>%
+  data.frame %>%
+  mutate(age = as.numeric(age > median(age))) %>%
+  DataFrame
+benz_idx <- which(names(colData(illuminaData)) %in% "benzene")
+```
+
+The TMLE-based biomarker discovery process can be invoked using the
+`biomarkertmle` function. The procedure is quite resource-intensive because it
+evaluates the association of each individual potential biomarker (of which there
+are over 20,000 in the included data set) with an exposure of interest, while
+accounting for potential confounding based on all other covariates included in
+the design matrix. We demonstrate the necessary syntax for calling the
+`biomarkertmle` function below, on a small number of the probes:
+
+```{r biomarkerTMLE_eval, message=FALSE, warning=FALSE}
+# compute TML estimates to evaluate differentially expressed biomarkers
+biotmle_out <- biomarkertmle(se = illuminaData[1:20, ],
+                             varInt = benz_idx,
+                             g_lib = c("SL.mean", "SL.glm"),
+                             Q_lib = c("SL.bayesglm", "SL.ranger"),
+                             cv_folds = 2,
+                             bppar_type = SerialParam()
+                            )
+```
+
+Note that parallelization is controlled entirely through the [`BiocParallel`
+package](https://bioconductor.org/packages/release/bioc/html/BiocParallel.html),
+and we set `SerialParam()` here for _sequential_ evaluation.
+
+The output of `biomarkertmle` is an object of class `bioTMLE`, containing four
+new slots: (1) `call`, the call to `biomarkertmle`; (2) `topTable`, an empty
+slot meant to hold the output of `limma::topTable`, after a later call to
+`modtest_ic`; and (3) `tmleOut`, a `data.frame` containing the point estimates
+of the associations of each biomarker with the exposure of interest based on the
+influence curve representation of the Average Treatment Effect.
+
+The output of `biomarkertmle` can be directly fed to `modtest_ic`, a wrapper
+around `limma::lmFit` and `limma::topTable` that outputs a `biotmle` object
+with the slots described above completely filled in. The `modtest_ic` function
+requires as input a `biotmle` object containing a data frame in the `tmleOut`
+field as well as a design matrix indicating the groupwise difference to be
+tested. The design matrix should contain an intercept term and a term for the
+exposure of interest (with discretized exposure levels). _Based on the relevant
+statistical theory, it is not appropriate to include any further terms in the
+design matrix (n.b., this differs from standard calls to `limma::lmFit`)_.
+
+```{r limmaTMLE_eval}
+modtmle_out <- modtest_ic(biotmle = biotmle_out)
+```
+
+After invoking `modtest_ic`, the resultant `bioTMLE` object will contain all
+information relevant to the analytic procedure for identifying biomarkers: that
+is, it will contain the original call to `biomarkertmle`, the result of running
+`limma::topTable`, and the result of running `biomarkertmle`. The statistical
+results of this procedure can be extracted from the `topTable` object in the
+`bioTMLE` object produced by `modtest_ic`.
+
+---
+
+## Visualization of Results
+
+This package provides several plotting methods that can be used to visualize
+the results of the TMLE-based biomarker discovery process. We demonstrate the
+syntax for calling the generic plotting methods below but refrain from showing
+the plots themselves since they are not particularly informative.
+
+The `plot` method for a `bioTMLE` object will produce a histogram of the
+adjusted p-values of each biomarker (based on the Benjamini-Hochberg procedure
+for controlling the False Discovery Rate) as generated by `limma::topTable`:
+
+```{r pval_hist_limma_adjp, eval=FALSE}
+plot(x = modtmle_out, type = "pvals_adj")
+```
+
+Setting the argument `type = "pvals_raw"` will instead produce a histogram of
+the raw p-values _(these are less informative and should, in general, not be
+used for inferential purposes, as the computation producing these p-values
+ignores the multiple testing nature of the biomarker discovery problem)_:
+
+```{r pval_hist_limma_rawp, eval=FALSE}
+plot(x = modtmle_out, type = "pvals_raw")
+```
+
+Heatmaps are useful graphics for visualizing the relationship between measures
+on genomic objects and covariates of interest. The `heatmap_ic` function
+provides this graphic for `bioTMLE` objects, allowing for the relationship
+between the exposure variable and some number of "top" biomarkers (as
+determined by the call to `modtest_ic`) to be visualized. In general, the
+heatmap for `bioTMLE` objects expresses how the contributions of each biomarker
+to the Average Treatment Effect (ATE) vary across differences in the exposure
+variable (that is, there is a causal interpretation to the findings). The plot
+produced is a `ggplot2` object and can be modified in place if stored properly.
+For our analysis:
+
+```{r heatmap_limma_results}
+benz_idx <- which(names(colData(illuminaData)) %in% "benzene")
+designVar <- as.data.frame(colData(illuminaData))[, benz_idx]
+designVar <- as.numeric(designVar == max(designVar))
+
+# build heatmap
+heatmap_ic(x = modtmle_out, left.label = "none", scale = TRUE,
+           clustering.method = "hierarchical", row.dendrogram = TRUE,
+           design = designVar, FDRcutoff = 1, top = 10)
+```
+
+The heatmap produced in this way is actually a form of _supervised clustering_,
+as described more generally (as _supervised distance matrices_) by
+@pollard2008supervised, wherein the notion of deriving clustering procedures
+from the results of supervised learning methods is formulated. Since the heatmap
+is based on the contributions of observations to the efficient influence
+function (EIF) of the target parameter, it directly visualizes the degree to
+which each biomarker informs the difference (due to the treatment effect)
+represented by the average treatment effect.
+
+The volcano plot is standard graphical tools for examining how changes in
+expression relate to the raw p-value. The utility of such plots lies in their
+providing a convenient way to identify and systematically ignore those genomic
+objects that have extremely low p-values due to extremely low variance between
+observations. The `volcano_ic` function provides much of the same
+interpretation, except that the fold change values displayed in the x-axis refer
+to changes in the _contributions of each biomarker to the Average Treatment
+Effect_ (in standard practice, for microarray technology, these would be fold
+changes in gene expression). The plot produced is a `ggplot2` object and, as
+such, can be modified in place. For our analysis:
+
+```{r volcano_plot_limma_results}
+volcano_ic(biotmle = modtmle_out)
+```
+
+---
+
+## Session Information
+
+```{r sessionInfo, echo=FALSE}
+sessionInfo()
+```