Biomedical-Data-Analysis / Git / [930d1e] /GoogleCloud/Submodule08_Differential_Analysis

Models:
DanielG/
Biomedical-Data-Analysis
Downloads: 1
[930d1e]: / GoogleCloud / Submodule08_Differential_Analysis_Proteomics.ipynb
History
Download this file
1134 lines (1133 with data), 47.4 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f63519ce-ca5a-4f64-93a5-c5491b58d46b",
   "metadata": {},
   "source": [
    "<img src=\"images/RIINBRE-Logo.jpg\" width=\"400\" height=\"400\"><img src=\"images/MIC_Logo.png\" width=\"600\" height=\"600\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc55d591-83ea-4db3-9c26-23831e09008b",
   "metadata": {},
   "source": [
    "# Analysis of Biomedical Data for Biomarker Discovery\n",
    "<a id=\"top8\"></a>\n",
    "## Submodule 8: Identification of IRI Biomarkers from Proteomic Data\n",
    "### Dr. Christopher L. Hemme\n",
    "### Director, [RI-INBRE Molecular Informatics Core](https://web.uri.edu/riinbre/mic/)\n",
    "### The University of Rhode Island College of Pharmacy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ca07ca3-7e42-42aa-ac12-36f85cd8cd4a",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "725c6951",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "This Jupyter Notebook performs a differential expression analysis on proteomic data to identify biomarkers for Ischemia-Reperfusion Injury (IRI).  Using the `limma` package in BioConductor, the notebook explores the impact of surgical treatment and a drug treatment (treprostinil) on protein expression levels over time.  It begins with data preparation, loading a pre-processed experimental object containing normalized proteomic data and metadata.  A linear regression model is built, first considering treatment alone, then incorporating time as a combined covariate.  The analysis utilizes MA plots, volcano plots, and the `topTable` function to visualize and rank differentially expressed proteins. The notebook concludes by discussing the next steps in biomarker discovery, including pathway analysis, meta-analysis, experimental validation, and comparison to known biomarkers, highlighting the increasing role of machine learning in handling the growing complexity of biomedical data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59cf4818",
   "metadata": {},
   "source": [
    "## Learning Objectives\n",
    "\n",
    "1. **Data Preparation for Differential Analysis:**  How to prepare proteomic data and metadata for analysis in R, including loading necessary packages, reading data from saved objects, combining covariates, and cleaning data.\n",
    "\n",
    "2. **Linear Regression for Differential Expression:** How to use the `limma` package in Bioconductor to perform linear regression-based differential expression analysis, accounting for batch effects using `duplicateCorrelation` and `lmFit`.\n",
    "\n",
    "3. **Working with Contrasts:** How to define and use contrasts to extract pairwise comparisons from the global regression model, enabling more accurate and sophisticated comparisons between different experimental conditions.\n",
    "\n",
    "4. **Interpreting MA Plots and Volcano Plots:** How to generate and interpret MA plots and volcano plots to visualize differential expression results, understand data distribution, assess normalization effects, and identify significantly changed proteins.\n",
    "\n",
    "5. **Extracting Top Differentially Expressed Proteins:** How to use `topTable` to extract the most significantly differentially expressed proteins based on log2 fold change and adjusted p-values.\n",
    "\n",
    "6. **Considering Time and Treatment Effects:** How to combine time and treatment covariates to analyze the dynamic changes in protein expression due to treatment over time.\n",
    "\n",
    "7. **Next Steps in Biomarker Discovery:**  Understanding the broader context of biomarker discovery beyond differential expression analysis, including pathway analysis, meta-analysis, and experimental validation.\n",
    "\n",
    "8. **Introduction to Machine Learning for Biomarker Discovery:** Briefly introduces the emerging role of machine learning in biomarker discovery and sets the stage for the next submodule."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b62cd792",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "1. **Create a Vertex AI Notebooks instance:** Choose your machine type and other configurations.\n",
    "2. **Select the R kernel:**  Make sure you choose the appropriate R environment when creating or opening the notebook.\n",
    "3. **Install R Packages:** The notebook itself handles the installation of the necessary R packages (`devtools`,`statmod`,`preprocessCore`,`limma`,`ComplexHeatmap`,`M3C`,`EnhancedVolcano`,`Glimma`,`plyr`,`dplyr`,`matrixStats`,`tidyverse`)\n",
    "4. **Data Storage:**  Upload your data to a Cloud Storage bucket and ensure your service account has access to read the data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fad852c-746f-4665-80ad-bf34dc68798d",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94033c51-d860-466a-9eec-22925952e2f5",
   "metadata": {},
   "source": [
    "In this submodule we will complete our analysis of the IRI proteomics data by caring out a linear regression based differential analysis (also called differential expression analysis) using the <b>limma</b> package in BioConductor.  In differential analysis, we are looking for significant changes in specific features between one or more states.  We measure these changes by comparing the \"peaks\" of each feature across the states.  A peak represents whatever signal we're measuring.  For proteomics and metabolomics, this is usually the signal intensity from a mass spectrometry machine.  For next-generation sequencing experiments, these peaks will be based on the read counts from your alignment step.  A peak can represent a variety of signals depending on your experiment, but what matters is that the peak represents some biological quantity.\n",
    "\n",
    "In comparing two peaks for the same feature across two states, we are interested in whether the difference in peak heights are statistically significant (measured by the <i>p</i>-value) and the magnitude of this difference.  This second value is a measure of the <b>effect size</b> and in omics studies typically the log2(fold change) between the two samples.  We can then use <i>p</i>-value and log2(fold change) cutoffs to identify which proteins significantly differ between states.  While this may sound complicated, to get these values we use the same linear regression methods we studied in Chapter 3 and the coefficients of our regression model are the log2(fold change) values."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3440e627-8984-4f92-9637-41bd479222c0",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>&#9995; Tip:</b> Blue boxes will indicate helpful tips.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de503954-faa3-4caf-a187-ade2d6ac46c7",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Fun Facts:</b> Used for interesting asides or notes.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d37299fd-ac17-44b8-8af3-64ab55d55559",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-success\">\n",
    "<b>&#128074; Success:</b> This alert box indicates a successful or positive action.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59fff243-9787-4690-8924-ecf71b2c7fa9",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "<b>&#128721; Caution:</b> A red box indicates potential hazards or pitfalls you may encounter.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1a533d6-5fa4-48f2-bae3-23220b03d75f",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ac52e0a",
   "metadata": {},
   "source": [
    "## Get Started"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8d55b2a-0c39-46e9-94c1-3a68ccecd5eb",
   "metadata": {},
   "source": [
    "### Data Prep"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3834b1e2-ed96-4fbe-8967-119495ae0c8a",
   "metadata": {},
   "source": [
    "We will start with our usual data prep (installation of packages, loading of experimental object, definition of variables)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d64118cf-2033-435c-973f-e4974e165c08",
   "metadata": {},
   "outputs": [],
   "source": [
    "if(!require(devtools)) install.packages(\"devtools\")\n",
    "    devtools::install_github(\"cran/statmod\")\n",
    "\n",
    "if (!require(\"BiocManager\", quietly = TRUE))\n",
    "    install.packages(\"BiocManager\")\n",
    "library('BiocManager')\n",
    "BiocManager::install(\"preprocessCore\")\n",
    "BiocManager::install(\"preprocessCore\", configure.args=\"--disable-threading\", force = TRUE)\n",
    "\n",
    "bioc_packages <- c(\"limma\", \"ComplexHeatmap\", \"M3C\", \"EnhancedVolcano\", \"Glimma\")\n",
    "installed_bioc_packages <- bioc_packages %in% rownames(installed.packages())\n",
    "if (any(installed_bioc_packages == FALSE)) {BiocManager::install(bioc_packages[!installed_bioc_packages])}\n",
    "\n",
    "packages <- c(\"plyr\", \"dplyr\", \"matrixStats\", \"statmod\")\n",
    "installed_packages <- packages %in% rownames(installed.packages())\n",
    "if (any(installed_packages == FALSE)) {install.packages(packages[!installed_packages])}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30edca06-b1a6-4a1a-bd8e-b95ceb90b583",
   "metadata": {},
   "outputs": [],
   "source": [
    "require('preprocessCore')\n",
    "require('limma')\n",
    "require('Glimma')\n",
    "require('ComplexHeatmap')\n",
    "require(\"EnhancedVolcano\")\n",
    "require('M3C')\n",
    "require('tidyverse')\n",
    "require('plyr')\n",
    "require('matrixStats')\n",
    "require('statmod')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a6d7252-9183-4940-835e-b0fde18ae49e",
   "metadata": {},
   "outputs": [],
   "source": [
    "exp_obj <- readRDS(file = \"data/Saved_Data/exp_obj.rds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58a681b8-7116-4a15-a40d-5a352b50bbdd",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>&#9995; Tip:</b> The proteome and metadata objects loaded below are saved in the .rds object in previous submodules. If you see blank variables for either, make sure you've run through the code in submodule 6 and 7 to have the most up to date saved variables in <i>exp_obj.rds</i></div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "829e23cd-f087-4085-abea-f6a88151bd36",
   "metadata": {},
   "outputs": [],
   "source": [
    "proteome <- exp_obj$data$proteomics$norm\n",
    "metadata <- exp_obj$metadata\n",
    "metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2fab409-f656-478e-8702-fc06cb84d263",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Preview proteome data\n",
    "head(proteome)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f1c9a9f-c5d0-450c-98b0-51e04ebf7d44",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "<b>&#128721; Caution:</b> Reminder that this version of the experimental object should contain the quantile normalized data from the last chapter.  Use <i>str(exp_obj)</i> if you need to verify that this is the case. \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a71f8eb4-2f00-4802-aa47-8b40f894d2cc",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>&#9995; Tip:</b> Reminder:\n",
    "    <ul>\n",
    "        <li><b>CTRL</b> - Control samples, no surgery or treatment\n",
    "        <li><b>SHAM</b> - Surgical treatment only; Designed to see if the stress of cutting the rat causes any changes\n",
    "        <li><b>PLB</b> - IRI model with placebo\n",
    "        <li><b>TRE</b> - IRI model treated with trepostrinil\n",
    "    </ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa554b12-ee6d-43d6-83ad-1b26bc861055",
   "metadata": {},
   "source": [
    "For the sake of simplicity and because this experiment wasn't explicitly designed as a longitudinal study, we will combine our Treatment and Time covariates into a single covariate with 16 factors using a custom R function that we will call <i>group_metadata</i>.  This will allow us to run a simple linear regression model using both covariates.  However, you are welcome to test more complicated models such as a multivariate model with both Treatment and Time as separate covariates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75f2cf02-22d7-456f-a57b-bce93f8c30c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# custom function that accepts a metadata data frame and covariates to group together into a single covariate.  \n",
    "# Returns a new version of the metadata data frame with the new covariate column called Combined added.\n",
    "\n",
    "group_metadata <- function(metadata, groups) {\n",
    "  metadata <- metadata %>%\n",
    "    group_by_at(groups) %>%\n",
    "    unite(Combined, all_of(groups), sep = \"\", remove = FALSE)\n",
    "  metadata$Combined <- factor(metadata$Combined)\n",
    "  metadata\n",
    "}\n",
    "metadata <- group_metadata(metadata, c(\"Treatment\", \"Time\"))\n",
    "metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "961779b1-ff68-4c9e-b824-c3cbd8a80d36",
   "metadata": {},
   "source": [
    "Before we run the combined covariates model, let's start with just Treatment.  We will use the cell means version of the formula which sets the intercept to 0.  Because our data is log2 transformed, our regression coefficients will represent the log2 fold change of the given protein between groups.  First we need to build a model matrix which takes our model and converts the categorical variables to dummy variables (0 or 1). Our model matrix will be as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b74e155c-6087-41af-867e-903893888b55",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.model.treatment <- model.matrix(~ 0 + Treatment, metadata)\n",
    "colnames(molecule.model.treatment) <- levels(metadata$Treatment)\n",
    "molecule.model.treatment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2152dbce-6002-46cb-8606-901a99be670b",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>&#9995; Tip:</b> The model matrix is also called a design matrix.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ad6dcd0-a17d-4ced-88ca-d9c4dbba1a5c",
   "metadata": {},
   "source": [
    "The next step will be to define the parameters of our regression model.  We will account for the batch effect observed in the last chapter by setting the <i>Batch</i> column as a blocking factor in the function <i>duplicateCorrelation</i> which computes the intrablock correlation.  Once these parameters are set, we will run a simple fit which <b>limma</b> will return as an <b>MArrayLM</b> object.  Because we need to account for our blocking factor, we will first run the <i>duplicateCorrelation</i> function to estimate the the correlations between the observations of our blocking variable (i.e. our batch effect).  We will then pass these results to the <i>lmFit</i> function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6b67291-d221-4aca-a4b0-19e7c2ef47cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.corfit.treatment <- duplicateCorrelation(\n",
    "    proteome,\n",
    "    molecule.model.treatment,\n",
    "    block = metadata$Batch\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "383b1ced-221f-4388-82ce-84200cda80a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.SimpleFit.treatment <- lmFit(\n",
    "    proteome,\n",
    "    molecule.model.treatment,\n",
    "    block = metadata$Batch,\n",
    "    correlation = molecule.corfit.treatment$consensus\n",
    ")\n",
    "molecule.SimpleFit.treatment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cc9354f-ab57-45f4-b1d5-7f89718ccd62",
   "metadata": {},
   "source": [
    "We'll look at these results in more detail later.  Right now, we need to think about <b>contrasts</b>.  Typically when we want to compare two states, we would use a simple <b>t-test</b>.  However, using multiple t-tests can introduce error if it's not corrected.  A better strategy is to extract pairwise information from our global regression model using <b>contrasts</b>.  With contrasts, we can define very sophisticated comparisons by combining different factors together, but we'll keep it simple for now and just compare different treatment states."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8562774f-f751-4fff-adfd-85b6dd6c385f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# define contrasts\n",
    "contrasts.treatment <- c(\"SHAM-CTRL\", \"PLB-SHAM\", \"TRE-PLB\")\n",
    "# make the contrasts\n",
    "molecule.contrasts.treatment <- makeContrasts(contrasts = contrasts.treatment, levels = molecule.model.treatment)\n",
    "# refit the simple fit we previously ran using the contrast data\n",
    "molecule.contrastsFit.treatment <- contrasts.fit(molecule.SimpleFit.treatment, molecule.contrasts.treatment)\n",
    "# calculate stats and rank proteins using an emperical Bayes method\n",
    "molecule.contrastsFit.treatment <- eBayes(molecule.contrastsFit.treatment)\n",
    "# fix column names to match contrasts\n",
    "colnames(molecule.contrastsFit.treatment$coefficients) = contrasts.treatment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5fa624c-ad49-4782-9196-9ad51bcd5152",
   "metadata": {},
   "source": [
    "The full regression model gives us a global overview of which covariates and features are important.  With contrasts, we can break that information down and see the changes for each protein within the context of each specific pair of Treatment states we're interested in."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2255036",
   "metadata": {},
   "source": [
    "### Data Cleaning\n",
    "\n",
    "Before continuing Lets see if there are any NA (Not Available) values in `molecule.contrastsFit.treatment$coefficients`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "641c26b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "nrow(molecule.contrastsFit.treatment$coefficients)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d99a98c7",
   "metadata": {},
   "source": [
    "It includes `2897` rows. Lets look at the first five rows of it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da13ec34",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.contrastsFit.treatment$coefficients[1:5,]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07e4d1c5",
   "metadata": {},
   "source": [
    "Now, we are going to count number of rows in the table such that it includes \"at least\" one NA value: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc31a41e",
   "metadata": {},
   "outputs": [],
   "source": [
    "num_NA_rows <- sum(rowSums(is.na(molecule.contrastsFit.treatment$coefficients)) > 0)\n",
    "num_NA_rows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db028fdf",
   "metadata": {},
   "source": [
    "We can also see which row includes a null value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2d068a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "which(is.na(molecule.contrastsFit.treatment$coefficients[,1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c80e0413",
   "metadata": {},
   "source": [
    "Now use the output to see what column contains the NA value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b087aa5",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.contrastsFit.treatment$coefficients[1434,]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdd18932",
   "metadata": {},
   "source": [
    "The above analysis shows that there is one NA value in `SHAM-CTRL` column of our data. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70641054",
   "metadata": {},
   "source": [
    "We can safely remove the 1434th row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "038fe4be",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.contrastsFit.treatment <- molecule.contrastsFit.treatment[-1434,]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a4b8eb4",
   "metadata": {},
   "source": [
    "### Diagnostic plots\n",
    "Before we analyze the specific protein changes, let's look at some diagnostic plots.  We'll start with the <b>MA plot</b>.  The MA plot shows the relationship between the log2(fold change) (the \"M\" in MA) and the mean expression (the \"A\").  We expect most proteins will not show differential expression in a given experiment, and thus most of the data points will be located at 0.  Extreme values along the y-axis indicate proteins with strongly differential values between the two states.  Particularly when working with genomic data, the MA plot will often show a \"fanning\" effect due to the fact that low expression genes (plotted on the left) show more variability in fold change than highly expressed genes (plotted on the right).  MA plots are also useful for evaluating the effects of normalization on the data.  Let's look at the MA plots for our three contrasts to get a sense of the landscape of our data.  We'll use the <b>Glimma</b> package which is used to create interactive plots from <b>limma</b> data.  You will be able to clock on specific data points to see the effects on specific genes."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7209c96-23d7-4c1d-a483-ce2abc696dbc",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>&#9995; Tip:</b> Glimma plots are interactive.  If you click on datapoints on the graph, the names of the points will appear in the box below and filter the table.</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe7640fb-799f-4b1f-b90a-ec87106ee04e",
   "metadata": {},
   "outputs": [],
   "source": [
    "#SHAM-CTRL\n",
    "glimmaMA(molecule.contrastsFit.treatment, coef=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4140443c-8bf9-4478-b7b5-0afded12ec55",
   "metadata": {},
   "outputs": [],
   "source": [
    "#PLB-SHAM\n",
    "glimmaMA(molecule.contrastsFit.treatment, coef=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff78684c-27fa-4cfa-b286-d8d7d43db0a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "#TRE-PLB\n",
    "glimmaMA(molecule.contrastsFit.treatment, coef=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9891dc61-947c-41ff-8bc1-4507f74853e9",
   "metadata": {},
   "source": [
    "What do these results show us?  As hoped, there aren't many significant difference between the expression profiles of control and sham, meaning the surgical technique itself introduces a minimal amount of noise.  As expected, the IRI itself induces many proteomic changes compared to the sham.  Based on treatment alone though, there are few significant changes in protein levels for the treated state.  However, we still have to consider the time component.  In early time points, the changes may not be significant enough to see an effect from treatment, and in later time points the damage could be significant enough that treatment is ineffective.  To get a better sense of what is happening, we will need to add the time component.  First though, let's verify this with the volcano plots."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a6fcafa-3232-4971-bc61-3fcde106dba3",
   "metadata": {},
   "source": [
    "A <b>volcano plot</b> is another extremely useful visualization tool for comparing pairwise omics data.  In this case, we are plotting the log2(fold change) on the x-axis and the -log10(<i>p</i>-value) on the y-axis.  This centers the data on the x-axis on zero, meaning that the plot on each side of zero represents one of our factors of interest.  By applying cutoff, we can easily identify quadrants of significantly different proteins and easily identify which treatment state they represent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99e2bbc8-c241-4f9e-a529-d5723b28657b",
   "metadata": {},
   "outputs": [],
   "source": [
    "#SHAM-CTRL\n",
    "glimmaVolcano(molecule.contrastsFit.treatment, coef=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b36821d4-f901-4c1e-9bb7-0f00249b7ea0",
   "metadata": {},
   "outputs": [],
   "source": [
    "#PLB-SHAM\n",
    "glimmaVolcano(molecule.contrastsFit.treatment, coef=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e7b3fb7-d6ec-4082-b79f-0d3a6ad91af0",
   "metadata": {},
   "outputs": [],
   "source": [
    "#TRE-PLB\n",
    "glimmaVolcano(molecule.contrastsFit.treatment, coef=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58e5a4a5-2ac7-4fb4-aab8-b5b8d2f67831",
   "metadata": {},
   "source": [
    "The results are consistent but easier to visualize on the volcano plot.  It's important to remember that the volcano plot is showing <b>relative</b> abundance, that is, how abundant is a signal in one condition versus another.  It doesn't tell you anything about the expression levels themselves.  The MA plot and box plots (and the raw data of course) are useful for determining why this abundance occurs (e.g. a protein is more highly expressed in condition B vs. condition A)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7f6defc-c3cf-48f1-88b0-288cc3e1540e",
   "metadata": {},
   "source": [
    "Now we will use the <i>topTable</i> function to look at the regression coefficients themselves for SHAM-CTRL, PLB_SHAM, and finally TRE-PLB. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d137b9e-2023-4607-9465-c78bd3638bfb",
   "metadata": {},
   "outputs": [],
   "source": [
    "topTable(molecule.contrastsFit.treatment, coef = \"SHAM-CTRL\", number = 10, adjust.method = \"BH\", sort.by = \"M\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0724a5d-34ee-4eea-8e4d-7a17212a7d09",
   "metadata": {},
   "outputs": [],
   "source": [
    "topTable(molecule.contrastsFit.treatment, coef = \"PLB-SHAM\", number = 10, adjust.method = \"BH\", sort.by = \"M\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5aa3eb65-afa2-4730-b81a-186ccf005c31",
   "metadata": {},
   "outputs": [],
   "source": [
    "topTable(molecule.contrastsFit.treatment, coef = \"TRE-PLB\", number = 10, adjust.method = \"BH\", sort.by = \"M\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20cad05c-e0fe-4acd-9cfd-b34d2217bf77",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#9995; Tip:</b> The \"M\" in the <b>sort.by</b> argument refers to the log2(fold change), or the \"M\" axis in our MA plot.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e31ef2ab-e4a3-4fa1-902a-8a5d045f45b0",
   "metadata": {},
   "source": [
    "The cumulative results show that there is a clear change in state between the sham and the placebo (i.e. untreated IRI) states, but we will need to include the time component to better understand what is happening.  Let's run these analyses again, except this time we will use our combined covariate to account for time. We'll only look at the 6 hour comparisons but feel free to try different contrasts. The following code is the same as above except for the formula used to define our model. Above, we used this formula to define our model as <i>model.matrix(\\~0 + Treatment, metadata)</i> whereas this below code defines the model as <i>model.matrix(\\~0 + Treatment, metadata).</i>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9027e7ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.model.combined <- model.matrix(~ 0 + Combined, metadata)\n",
    "colnames(molecule.model.combined) <- levels(metadata$Combined)\n",
    "molecule.model.combined"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1455d9dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.corfit.combined <- duplicateCorrelation(\n",
    "    proteome,\n",
    "    molecule.model.combined,\n",
    "    block = metadata$Batch\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8623c66d",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.SimpleFit.combined <- lmFit(\n",
    "    proteome,\n",
    "    molecule.model.combined,\n",
    "    block = metadata$Batch,\n",
    "    correlation = molecule.corfit.combined$consensus\n",
    ")\n",
    "molecule.SimpleFit.combined"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19e84156",
   "metadata": {},
   "outputs": [],
   "source": [
    "contrasts.combined <- c(\"SHAM6-CTRL0\", \"PLB6-SHAM6\", \"TRE6-PLB6\")\n",
    "molecule.contrasts.combined <- makeContrasts(contrasts = contrasts.combined, levels = molecule.model.combined)\n",
    "molecule.contrastsFit.combined <- contrasts.fit(molecule.SimpleFit.combined, molecule.contrasts.combined)\n",
    "molecule.contrastsFit.combined <- eBayes(molecule.contrastsFit.combined)\n",
    "colnames(molecule.contrastsFit.combined$coefficients) = contrasts.combined"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4adc1ed9",
   "metadata": {},
   "source": [
    "### Data Cleaning\n",
    "\n",
    "Before continuing Lets see if there are any NA (Not Available) values in `molecule.contrastsFit.combined$coefficients`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7e8eb97a",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.contrastsFit.combined$coefficients[1:5,]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f8dfb4d",
   "metadata": {},
   "outputs": [],
   "source": [
    "nrow(molecule.contrastsFit.combined$coefficients)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51014b36",
   "metadata": {},
   "source": [
    "`molecule.contrastsFit.combined$coefficients` includes 2897 rows. Lets see how many of these rows includes at least one NA (Not Available) value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "edb236df",
   "metadata": {},
   "outputs": [],
   "source": [
    "rows_with_NA=rowSums(is.na(molecule.contrastsFit.combined$coefficients)) > 0 # It is a vector of length 2897 with Boolean values\n",
    "num_NA_rows <- sum(rows_with_NA)\n",
    "num_NA_rows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2097c3fc",
   "metadata": {},
   "source": [
    "Finally we remove rows with NA value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6ffd122",
   "metadata": {},
   "outputs": [],
   "source": [
    "molecule.contrastsFit.combined <- molecule.contrastsFit.combined[!rows_with_NA, ]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "573ca07d",
   "metadata": {},
   "source": [
    "### Diagnostic plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96b24090",
   "metadata": {},
   "outputs": [],
   "source": [
    "#SHAM6-CTRL0\n",
    "glimmaMA(molecule.contrastsFit.combined, coef=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae6f447a-2ebf-4c2d-9ac3-c837f498d845",
   "metadata": {},
   "outputs": [],
   "source": [
    "#PLB6-SHAM6\n",
    "glimmaMA(molecule.contrastsFit.combined, coef=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b38b2c88-4b48-44ec-9d56-33098be53abf",
   "metadata": {},
   "outputs": [],
   "source": [
    "#TRE6-PLB6\n",
    "glimmaMA(molecule.contrastsFit.combined, coef=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ca743dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "#SHAM6-CTRL0\n",
    "glimmaVolcano(molecule.contrastsFit.combined, coef=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5623ad83-fcca-4ecc-b7ac-de0269e78d16",
   "metadata": {},
   "outputs": [],
   "source": [
    "#PLB6-SHAM6\n",
    "glimmaVolcano(molecule.contrastsFit.combined, coef=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "afc0b6d9-0561-40ad-9df0-8cf8943ee74a",
   "metadata": {},
   "outputs": [],
   "source": [
    "#TRE6-PLB6\n",
    "glimmaVolcano(molecule.contrastsFit.combined, coef=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d38ade8",
   "metadata": {},
   "outputs": [],
   "source": [
    "topTable(molecule.contrastsFit.combined, coef = \"SHAM6-CTRL0\", number = 10, adjust.method = \"BH\", sort.by = \"M\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdd6cebd",
   "metadata": {},
   "outputs": [],
   "source": [
    "topTable(molecule.contrastsFit.combined, coef = \"PLB6-SHAM6\", number = 10, adjust.method = \"BH\", sort.by = \"M\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51cacdf4",
   "metadata": {},
   "outputs": [],
   "source": [
    "topTable(molecule.contrastsFit.combined, coef = \"TRE6-PLB6\", number = 10, adjust.method = \"BH\", sort.by = \"M\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4022ef83-3e34-487b-94aa-60bb4b756575",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>&#9995; Tip:</b> As an exercise, redo the previous analysis using the multivariate model \"~ 0 + Treatment + Time\" to see if separating the covariates has an affect on the results.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c846d92-3ec6-4715-ace0-0a7b9a981526",
   "metadata": {},
   "source": [
    "Now we have a clearer view of the changes that are happening between the states.  At 6 hours when we expect IRI damage to be noticeable but still treatable, we start to see significant differences in the protein abundances between the sham and placebo states, but we also see several proteins affected by the trepostrinil treatment.  While we can try different models to identify other covariates affected the results, we have now identified a set of potential biomarkers that can indicate the degree of severity of IRI and other biomarkers which may indicate the effects of the treatment on IRI damage."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9512d22c-0565-4874-a768-ebee1142f49a",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8425be2f-3c64-49f4-9eec-08a90ca20373",
   "metadata": {},
   "source": [
    "# Finishing Our Experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "821dba11-0d53-486c-8673-d8dc1f3297e2",
   "metadata": {},
   "source": [
    "You should now be familiar with the concepts of linear and logistic regression and how they can be used to compare biomarkers or to classify samples or to assess the quality of biomarkers.  We have discussed common methods of exploratory analysis such as PCA and heatmaps that are useful for identifying global patterns in omics data.  Finally, we discussed differential analysis of omics data to identify specific potential biomarkers.  However, we are ending on something of a cliffhanger.  Now that we have a list of potential protein biomarkers from our proteomics analysis, what comes next?\n",
    "\n",
    "First, we need to assess the biological relevance of our results.  A common way to do this is with <b>pathway analysis</b>.  Most proteins are part of complex metabolic and regulatory networks that represent some biological theme (e.g. TCA cycle, apoptosis, etc.).  We would expect proteins within a specific pathway or biological process to be correlated together, otherwise it could lead to significant disruption of the cellular machinery.  Just as we can test for differential abundance of proteins, we can also look at differential abundance of pathways.  In our case, several of the proteins we find deferentially expressed are related to kidney injury and function, which is what we would expect.  With this biological knowledge, we can begin to filter our list of potential biomarkers to include thode that are most likely to be biological relevant in our model (i.e. causative of or directly caused by the state change).\n",
    "\n",
    "Ideally, we would design our experiments to have sufficient numbers of samples and controls to maximize the statistical power of the experiment.  In practice, budget constraints and other factors often limit the number of samples and replicates we can test.  A complementary approach is to compare your data set with similar datasets from other sources to maximize the statistical power of the grouped data set.  This is known as <b>meta-analysis</b>. Meta-analysis is particularly useful today because the number of omics datasets is very high compared to the past and for commonly studied systems, it's possible to have dozens if not hundreds of related datasets.  However, meta-analysis is not trivial.  In our single proteomics dataset, we identified one batch effect, but in meta-analysis, each dataset is likely to have its own unique batch effects.  These can result from experimental design, sample prep strategy, instrument used to generate the data, the skills of the lab staff, geographical location, and numerous other sources.  Because of the complexity of meta-analysis, many labs choose not to do this.  When properly performed however, meta-analysis can provide an additional level of confidence that the biomarkers you have identified are biologically and statistically significant.\n",
    "\n",
    "One common criticism of omics data is whether the results are real or a result of experimental artifacts. As a data set gets bigger, it becomes more likely that you will begin to find statistically significant results purely by chance.  To forestall such criticisms, it is standard procedure to experimentally validate key omics results whenever possible.  In our case, that means testing the potential biomarkers experimentally to prove that the patterns we see are actually happening.  There are several ways to experimentally validate omics results, the most common of which are listed below:\n",
    "\n",
    "| Omics Method | Experimental Validation Method |\n",
    "| --- | --- |\n",
    "| Transcriptomics | Quantitative PCR, Northern Blot |\n",
    "| Epigenomics | ChIP, Protein-Binding Assay |\n",
    "| Proteomics | Western Blot |\n",
    "| Metabolomics/Lipidomics/Glycomics | HPLC, NMR |\n",
    "\n",
    "Once validated with experimental results, we can begin comparing our potential biomarker to known biomarkers using the regression methods we have already discussed.  More sophisticated methods of biomarker validation can be found in the scientific literature, and you should now have the background to study these methods on your own.\n",
    "\n",
    "To summarize, this would be the strategy we would follow for identifying potential proteomic biomarkers:\n",
    "\n",
    "1. Generate hypothesis designed to address your biological question\n",
    "2. Design proteomics experiment with proper controls and identify relevant covariates\n",
    "3. Generate proteomics data\n",
    "4. Differential analysis of proteomics data to identify potential biomarkers indicating relevant state changes\n",
    "5. Pathway analysis to link potential biomarkers to pathways relevant to your hypothesis and to identify differentially expressed pathways\n",
    "6. Experimental validation of biologically relevant biomarkers using Western blots\n",
    "7. Comparison of potential biomarkers to known biomarkers to assess clinical utility"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0001989f-29f4-48d1-bdb1-5f4376471dc0",
   "metadata": {},
   "source": [
    "## But Wait, There's More!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2aec9fd-0e1b-4205-94da-2a887948386b",
   "metadata": {},
   "source": [
    "For a traditional bioinformatics analysis, this would be the end of this particular experiment.  However, at the time of this writing, the field of bioinformatics is undergoing a rapid transition.  We now have publicly available omics datasets from thousands of laboratories spanning 20+ years.  We are also rapidly trasitioning to <b>single cell</b> and <b>spatial omics</b> methods which allow for much much higher resolution omics studies than traditionally available.  This obviously represents a huge amount of data and analysis by traditional methods is becoming difficult.  This is precipitating a transition of biomedical research to cloud environments coupled to <b>machine learning</b> techniques designed to handle massive, complex data sets.  While this may sound daunting, we've already covered basic techniques involved in machine learning, including linear and logistics regression, principle components analysis, and clustering.  While there are many applications for machine learning in biomedical data science, biomarker discovery is especially amenible to these methods and thus we will introduce basic ML models for biomarker discovery in the final submodule, <b>Submodule 9: Biomarker Discovery Using Machine Learning</b>.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac5bbcfe-9a6f-4cd1-b7c8-106ee0485e4d",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4f82ec2-06ff-4fd7-9b38-0b90ad3f385c",
   "metadata": {},
   "source": [
    "<p><span style=\"font-size: 30px\"><b>Quizzes</b></span> <span style=\"float : inline;\">(run the command below to display the quizzes)</span> </p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e36754ce-8a91-4510-b731-f36c08abb071",
   "metadata": {},
   "outputs": [],
   "source": [
    "IRdisplay::display_html('<iframe src=\"quizes/Chapter8_Quizes.html\" width=100% height=450></iframe>')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "77fb1c36-349f-457c-91fd-0a8891a4e1e4",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41625d16",
   "metadata": {},
   "outputs": [],
   "source": [
    "sessionInfo()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfd784b4",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4633e324",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "This submodule demonstrated the identification of potential protein biomarkers for Ischemia Reperfusion Injury (IRI) using linear regression-based differential analysis with the `limma` package in R.  By incorporating both treatment and time covariates, we revealed significant protein abundance changes between sham and placebo groups at the 6-hour mark, as well as alterations induced by treprostinil treatment. While this analysis provides a strong starting point, further steps are crucial for solidifying these findings.  These include pathway analysis to assess biological relevance, potential meta-analysis with comparable datasets to enhance statistical power, and experimental validation using methods like Western blots.  The next submodule will explore the application of machine learning techniques for biomarker discovery, offering a powerful complement to traditional bioinformatics approaches in the era of big data and increasingly complex experimental designs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe591226",
   "metadata": {},
   "source": [
    "## Clean up\n",
    "\n",
    "Remember to move to the next notebook or shut down your instance if you are finished."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "364463db-73c2-457c-83e9-20f0c15c8df0",
   "metadata": {},
   "source": [
    "<div style=\"display: flex; justify-content: center; margin-top: 20px; width: 100%;\"> \n",
    "    <div style=\"display: flex; justify-content: space-between; width: 50%;\"> \n",
    "        <div> \n",
    "            <a href=https://github.com/NIGMS/Analysis-of-Biomedical-Data-for-Biomarker-Discovery/blob/master/GoogleCloud/Submodule07_Exploratory_Proteomic_Analysis.ipynb#overview>Previous section</a>                                            \n",
    "        </div> \n",
    "        <div> \n",
    "            <a href=\"#top8\">Top of this page</a>                                                      \n",
    "        </div> \n",
    "        <div> \n",
    "            <a href=https://github.com/NIGMS/Analysis-of-Biomedical-Data-for-Biomarker-Discovery/blob/master/GoogleCloud/Submodule09_Biomarker_Discovery_using_ML.ipynb#overview>Next section</a>\n",
    "        </div> \n",
    "    </div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99f80099-1044-4d04-a212-5d84b4a22d15",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8b4a3ad-acf8-4c8b-94f2-11f2c0c30284",
   "metadata": {},
   "source": [
    "[Hou J, Tolbert E, Birkenbach M, Ghonem NS. Treprostinil alleviates hepatic mitochondrial injury during rat renal ischemia-reperfusion injury. Biomed Pharmacother. 2021 Nov;143:112172. doi: 10.1016/j.biopha.2021.112172. Epub 2021 Sep 21. PMID: 34560548; PMCID: PMC8550798.][hou]<br>\n",
    "[Ding M, Tolbert E, Birkenbach M, Gohh R, Akhlaghi F, Ghonem NS. Treprostinil reduces mitochondrial injury during rat renal ischemia-reperfusion injury. Biomed Pharmacother. 2021 Sep;141:111912. doi: 10.1016/j.biopha.2021.111912. Epub 2021 Jul 15. PMID: 34328097; PMCID: PMC8429269.][ding]<br>\n",
    "\n",
    "[ding]: https://pubmed.ncbi.nlm.nih.gov/34328097/ \"Ding M, Tolbert E, Birkenbach M, Gohh R, Akhlaghi F, Ghonem NS. Treprostinil reduces mitochondrial injury during rat renal ischemia-reperfusion injury. Biomed Pharmacother. 2021 Sep;141:111912. doi: 10.1016/j.biopha.2021.111912. Epub 2021 Jul 15. PMID: 34328097; PMCID: PMC8429269.\"\n",
    "[hou]: https://pubmed.ncbi.nlm.nih.gov/34560548/ \"Hou J, Tolbert E, Birkenbach M, Ghonem NS. Treprostinil alleviates hepatic mitochondrial injury during rat renal ischemia-reperfusion injury. Biomed Pharmacother. 2021 Nov;143:112172. doi: 10.1016/j.biopha.2021.112172. Epub 2021 Sep 21. PMID: 34560548; PMCID: PMC8550798.\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R (Local)",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.4.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}