--- a
+++ b/vignettes/multiGSEA.rmd
@@ -0,0 +1,415 @@
+---
+title: "multiGSEA: an example workflow"
+author: 
+- name: Sebastian Canzler
+  affiliation: Helmholtz-Centre for Environmental Research - UFZ, Leipzig, Germany
+- name: Jörg Hackermüller
+  affiliation: Helmholtz-Centre for Environmental Research - UFZ, Leipzig, Germany
+package: multiGSEA
+date: "Jan 06, 2025"
+output:
+  BiocStyle::html_document:
+    toc_float: true
+  BiocStyle::pdf_document: default
+bibliography: references.bib
+vignette: >
+  %\VignetteEncoding{UTF-8}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteIndexEntry{multiGSEA: an example workflow}
+---
+
+# Introduction
+
+The `multiGSEA` package was designed to run a robust GSEA-based pathway
+enrichment for multiple omics layers. The enrichment is calculated for each
+omics layer separately and aggregated p-values are calculated afterwards to
+derive a composite multi-omics pathway enrichment.
+
+Pathway definitions can be downloaded from up to eight different pathway
+databases by means of the `graphite` Bioconductor package [@graphite:2018].
+Feature mapping for transcripts and proteins is supported towards Entrez Gene
+IDs, Uniprot, Gene Symbol, RefSeq, and Ensembl IDs.  The mapping is accomplished
+through the `AnnotationDbi` package [@AnnotationDbi:2019] and currently
+supported for 11 different model organisms including human, mouse, and rat. ID
+conversion of metabolite features to Comptox Dashboard IDs (DTXCID, DTXSID),
+CAS-numbers, Pubchem IDs (CID), HMDB, KEGG, ChEBI, Drugbank IDs, or common
+metabolite names is accomplished through the AnnotationHub package
+`metabliteIDmapping`. This package provides a comprehensive ID mapping for more
+than 1.1 million entries.
+
+This vignette covers a simple example workflow illustrating how the `multiGSEA`
+package works.  The omics data sets that will be used throughout the example
+were originally provided by Quiros _et al._ [@Quiros:2017].  In their publication
+the authors analyzed the mitochondrial response to four different toxicants,
+including Actinonin, Diclofenac, FCCB, and Mito-Block (MB), within the
+transcriptome, proteome, and metabolome layer. The transcriptome data can be
+downloaded from [NCBI
+GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84631), the proteome
+data from the [ProteomeXchange
+Consortium](http://proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD006293),
+and the non-targeted metabolome raw data can be found in the online supplement.
+Please note, this vignette focuses solely on the Actinonin data set.
+
+## Installation
+
+There are two different ways to install the `multiGSEA` package.
+
+First, the `multiGSEA` package is part of
+[Bioconductor](https://bioconductor.org/packages/devel/bioc/html/multiGSEA.html).
+Hence, the best way to install the package is via `BiocManager`. Start R
+(>=4.0.0) and run the following commands:
+
+```{r bioconductor, eval=FALSE}
+if (!requireNamespace("BiocManager", quietly = TRUE)) {
+  install.packages("BiocManager")
+}
+
+# The following initializes usage of Bioc devel
+BiocManager::install(version = "devel")
+
+BiocManager::install("multiGSEA")
+```
+
+Alternatively, the `multiGSEA` package is hosted on our github page
+https://github.com/yigbt/multiGSEA and can be installed via the
+`devtools` package:
+
+```{r devtools, eval=FALSE}
+install.packages("devtools")
+devtools::install_github("https://github.com/yigbt/multiGSEA")
+```
+
+Once installed, just load the `multiGSEA` package with:
+
+```{r load_multiGSEA, eval=FALSE}
+library(multiGSEA)
+```
+
+
+# Example workflow 
+
+A common workflow which is exemplified in this vignette is typically separated
+in the following steps:
+
+1. Load required libraries, including the `multiGSEA` package, and omics data sets.
+2. Create data structure for enrichment analysis.
+3. Download and customize the pathway definitions.
+4. Run the pathway enrichment for each omics layer.
+5. Calculate the aggregated pathway enrichment.
+
+
+## Load libraries and omics data
+
+At first, we need to load the necessary packages to run the multi-omics
+enrichment analysis. In our example data we have to deal with omics data that
+has been measured in human cell lines. We therefore need the `org.Hs.eg.db`
+package [@CarlsonHs:2019] for transcript and protein mapping. In case the
+omics data was measured in mouse or rat, we would need the packages
+`org.Mm.eg.db` [@CarlsonMm:2019] and `org.Rn.eg.db` [@CarlsonRn:2019],
+respectively. 
+
+```{r load_mapping_library, results='hide', warning=FALSE, message=FALSE}
+library("org.Hs.eg.db")
+```
+
+In principle, `multiGSEA` is able to deal with 11 different model organisms.
+A summary of supported organisms, their naming format within `multiGSEA` and
+their respective `AnnotationDbi` package is shown in Table
+\@ref(tab:organismsTable).
+
+```{r organismsTable, results='asis', echo=FALSE}
+caption <- "Supported organisms, their abbreviations being used in `multiGSEA`,
+            and mapping database that are needed for feature conversion.
+            Supported abbreviations can be seen with `getOrganisms()`"
+df <- data.frame(
+  Organisms = c(
+    "Human", "Mouse", "Rat", "Dog", "Cow", "Pig",
+    "Chicken", "Zebrafish", "Frog", "Fruit Fly",
+    "C.elegans"
+  ),
+  Abbreviations = c(
+    "hsapiens", "mmusculus", "rnorvegicus",
+    "cfamiliaris", "btaurus", "sscrofa",
+    "ggallus", "drerio", "xlaevis",
+    "dmelanogaster", "celegans"
+  ),
+  Mapping = c(
+    "org.Hs.eg.db", "org.Mm.eg.db", "org.Rn.eg.db",
+    "org.Cf.eg.db", "org.Bt.eg.db", "org.Ss.eg.db",
+    "org.Gg.eg.db", "org.Xl.eg.db", "org.Dr.eg.db",
+    "org.Dm.eg.db", "org.Ce.eg.db"
+  )
+)
+
+knitr::kable(df, caption = caption)
+```
+
+To run the analysis of this vignette, load the installed version of
+`multiGSEA`.
+
+```{r load_multiGSEA_package, results='hide', message=FALSE, warning=FALSE}
+library(multiGSEA)
+library(magrittr)
+```
+
+Load the omics data for each layer where an enrichment should be
+calculated.  The example data is provided within the package and
+already preprocessed such that we have log2 transformed fold changes
+and their associated p-values.  As mentioned previously, the vignette
+covers the Actinonin-treated data set from Quiros _et al._ [@Quiros:2017].
+
+```{r load_omics_data}
+# load transcriptomic features
+data(transcriptome)
+
+# load proteomic features
+data(proteome)
+
+# load metabolomic features
+data(metabolome)
+```
+
+
+### Cautionary note
+
+This example involves preprocessed omics data from public
+repositories, which means that the data might look different when you
+download and pre-process it with your own workflow. Therefore, we put
+our processed data as an example data in the R package. We here sketch
+out the pipeline described in the `multiGSEA` paper. We will not focus
+on the pre-processing steps and how to derive the necessary input
+format for the multi-omics pathway enrichment in terms of
+differentially expression analysis, since this is highly dependent on
+your experiment and analysis workflow.
+
+However, the required input format is quite simple and exactly the
+same for each input omics layer: A data frame with 3 mandatory columns,
+including feature IDs, the log2-transformed fold change (logFC), and
+the associated p-value.
+
+The header of the data frame can be seen in Table \@ref(tab:omicsTable):
+
+```{r omicsTable, results='asis', echo=FALSE}
+caption <- "Structure of the necessary omics data. For each layer
+            (here: transcriptome), feature IDs, log2FCs, and p-values
+            are needed."
+
+knitr::kable(
+  transcriptome %>%
+    dplyr::arrange(Symbol) %>%
+    dplyr::slice(1:6),
+  caption = caption
+)
+```
+
+
+
+## Create data structure
+
+`multiGSEA` works with nested lists where each sublist represents an
+omics layer.  The function `rankFeatures` calculates the pre-ranked
+features, that are needed for the subsequent calculation of the
+enrichment score. `rankFeatures` calculates the a local statistic `ls`
+based on the direction of the fold change and the magnitude of its
+significance:
+
+\begin{equation}
+  ls = sign( log_2(FC)) * log_{10}( p-value)
+\end{equation}
+
+
+Please note, that any other rank metric will work as well and the
+choice on which one to chose is up to the user. However, as it was
+shown by Zyla _et al._ [@Zyla:2017], the choice of the applied metric
+might have a big impact on the outcome of your analysis.
+
+```{r rank_features, results='hide'}
+# create data structure
+omics_data <- initOmicsDataStructure(layer = c(
+  "transcriptome",
+  "proteome",
+  "metabolome"
+))
+
+## add transcriptome layer
+omics_data$transcriptome <- rankFeatures(
+  transcriptome$logFC,
+  transcriptome$pValue
+)
+names(omics_data$transcriptome) <- transcriptome$Symbol
+
+## add proteome layer
+omics_data$proteome <- rankFeatures(proteome$logFC, proteome$pValue)
+names(omics_data$proteome) <- proteome$Symbol
+
+## add metabolome layer
+## HMDB features have to be updated to the new HMDB format
+omics_data$metabolome <- rankFeatures(metabolome$logFC, metabolome$pValue)
+names(omics_data$metabolome) <- metabolome$HMDB
+names(omics_data$metabolome) <- gsub(
+  "HMDB", "HMDB00",
+  names(omics_data$metabolome)
+)
+```
+
+The first elements of each omics layer are shown below:
+
+```{r omics_short}
+omics_short <- lapply(names(omics_data), function(name) {
+  head(omics_data[[name]])
+})
+names(omics_short) <- names(omics_data)
+omics_short
+```
+
+## Download and customize pathway definitions
+
+Now we have to select the databases we want to query and the omics
+layer we are interested in before pathway definitions are downloaded
+and features are mapped to the desired format.
+
+```{r calculate_enrichment, results='hide', message=FALSE, warning=FALSE}
+databases <- c("kegg", "reactome")
+layers <- names(omics_data)
+
+pathways <- getMultiOmicsFeatures(
+  dbs = databases, layer = layers,
+  returnTranscriptome = "SYMBOL",
+  returnProteome = "SYMBOL",
+  returnMetabolome = "HMDB",
+  useLocal = FALSE
+)
+```
+
+The first two pathway definitions of each omics layer are shown below:
+
+```{r pathways_short}
+pathways_short <- lapply(names(pathways), function(name) {
+  head(pathways[[name]], 2)
+})
+names(pathways_short) <- names(pathways)
+pathways_short
+```
+
+
+
+## Run the pathway enrichment
+
+At this stage, we have the ranked features for each omics layer and the
+extracted and mapped features from external pathway databases. In the following
+step we can use the `multiGSEA` function to calculate the enrichment for all
+omics layer separately.
+
+```{r run_enrichment, results='hide', message=FALSE, warning=FALSE}
+# use the multiGSEA function to calculate the enrichment scores
+# for all omics layer at once.
+enrichment_scores <- multiGSEA(pathways, omics_data)
+```
+
+The enrichment score data structure is a list containing sublists named
+`transcriptome`, `proteome`, and `metabolome`. Each sublist represents the
+complete pathway enrichment for this omics layer.
+
+## Calculate the aggregated p-values
+
+Making use of the Stouffers Z-method to combine multiple p-values that have been
+derived from independent tests that are based on the same null hypothesis. The
+function `extractPvalues` creates a simple data frame where each row represents
+a pathway and columns represent omics related p-values and adjusted p-values.
+This data structure can then be used to calculate the aggregated p-value. The
+subsequent calculation of adjusted p-values can be achieved by the function
+`p.adjust`.
+
+`multiGSEA` provided three different methods to aggregate p-values. These
+methods differ in their way how they weight either small or large p-values. By
+default, `combinePvalues` will apply the Z-method or Stouffer's method
+[@Stouffer:1949] which has no bias towards small or large p-values. The widely
+used Fisher's combined probability test [@Fisher:1932] can also be applied but
+is known for its bias towards small p-values. Edgington's method goes the
+opposite direction by favoring large p-values [@Edgington:1972]. Those methods
+can be applied by setting the parameter `method` to "fisher" or "edgington".
+
+```{r combine_pvalues}
+df <- extractPvalues(
+  enrichmentScores = enrichment_scores,
+  pathwayNames = names(pathways[[1]])
+)
+
+df$combined_pval <- combinePvalues(df)
+df$combined_padj <- p.adjust(df$combined_pval, method = "BH")
+
+df <- cbind(data.frame(pathway = names(pathways[[1]])), df)
+```
+
+**Please note:** In former versions of `multiGSEA`, adjusted p-values of
+individual omics-layers have been combined. The correction for multiple testing
+prior to the p-value combination will, however, violate the assumptions about
+p-values of certain combination methods (e.g., Fisher's combined probability
+test).
+
+Therefore, **multiGSEA uses p-values for combination from RELEASE 3.18**.
+To keep previous results reproducible, we introduced the `col_pattern` parameter
+which can be set to `padj` to use adjusted p-values.
+
+
+Finally, print the pathways sorted based on their combined adjusted p-values.
+For displaying reasons, only the adjusted p-values are shown in Table
+\@ref(tab:resultTable).
+
+```{r resultTable, results='asis', echo=FALSE}
+caption <- "Table summarizing the top 15 pathways where we can calculate an
+            enrichment for all three layers . Pathways from KEGG, Reactome,
+            and BioCarta are listed based on their aggregated adjusted p-value.
+            Corrected p-values are displayed for each omics layer separately and
+            aggregated by means of the Stouffer's Z-method."
+
+knitr::kable(
+  df %>%
+    dplyr::arrange(combined_padj) %>%
+    dplyr::filter(!is.na(metabolome_padj)) %>%
+    dplyr::select(c(pathway, transcriptome_padj, proteome_padj, metabolome_padj, combined_pval)) %>%
+    dplyr::slice(1:15),
+  caption = caption
+)
+```
+
+# Customizable gene sets
+
+In principle, `multiGSEA` can also be run as single/multi omics analysis on
+custom gene sets.
+
+The `pathways` object storing the pathway features across multiple omics layers
+is a nested list, and hence can be designed manually to fit ones purposes.
+
+In the following example, HALLMARK gene sets are retrieved from `MSigDB` and
+used to create a transcriptomics input list:
+
+```{r msigdbr, eval=FALSE}
+library(msigdbr)
+library(dplyr)
+
+hallmark <- msigdbr(species = "Rattus norvegicus", category = "H") %>%
+  dplyr::select(gs_name, ensembl_gene) %>%
+  dplyr::filter(!is.na(ensembl_gene)) %>%
+  unstack(ensembl_gene ~ gs_name)
+
+pathways <- list("transcriptome" = hallmark)
+```
+
+**Please note**, feature sets across multiple omics layers have to be in the
+same order and their names have to be identical, see the example presented above.
+
+
+# Session Information
+
+Here is the output of `sessionInfo()` on the system on which this document was
+compiled:
+
+```{r session, echo=FALSE}
+sessionInfo()
+```
+
+
+# References
+
+