Biomedical-Data-Analysis / Git / [cb2392] /GoogleCloud/Submodule07_Exploratory_Proteomic

Models:
DanielG/
Biomedical-Data-Analysis
Downloads: 1
[cb2392]: / GoogleCloud / Submodule07_Exploratory_Proteomic_Analysis.ipynb
History
Download this file
994 lines (993 with data), 41.5 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f63519ce-ca5a-4f64-93a5-c5491b58d46b",
   "metadata": {},
   "source": [
    "<img src=\"images/RIINBRE-Logo.jpg\" width=\"400\" height=\"400\"><img src=\"images/MIC_Logo.png\" width=\"600\" height=\"600\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc55d591-83ea-4db3-9c26-23831e09008b",
   "metadata": {},
   "source": [
    "# Analysis of Biomedical Data for Biomarker Discovery\n",
    "<a id=\"top7\"></a>\n",
    "## Submodule 7: Exploratory Analysis of Proteomics IRI Data\n",
    "### Dr. Christopher L. Hemme\n",
    "### Director, [RI-INBRE Molecular Informatics Core](https://web.uri.edu/riinbre/mic/)\n",
    "### The University of Rhode Island College of Pharmacy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ca07ca3-7e42-42aa-ac12-36f85cd8cd4a",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f0476d6",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "This Jupyter Notebook performs exploratory analysis of proteomics data from a rat renal Ischemia-Reperfusion Injury (IRI) experiment to identify potential biomarkers.  After loading necessary R packages and the experimental data, the notebook quantile normalizes the log2 transformed proteome data.  It then employs Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) for dimensionality reduction, revealing a batch effect that is subsequently corrected.  PCA and biplot visualizations are generated, first on the full dataset and then on the top 100 most variable proteins, to explore sample clustering and identify proteins contributing to group separation based on treatment and time.  Finally, heatmaps are generated to visualize patterns in the data, further highlighting potential biomarkers. The normalized data is saved for use in subsequent differential analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea613fe8",
   "metadata": {},
   "source": [
    "## Learning Objectives\n",
    "\n",
    "1. **Understand the rationale for exploratory analysis in omics data:**  Learn why exploring large datasets like proteomics data is crucial for biomarker discovery, and appreciate the challenges posed by the \"curse of dimensionality\".\n",
    "\n",
    "2. **Perform data normalization and understand its importance:**  Practice quantile normalization using the `preprocessCore` package in R and understand its role in reducing variation and noise in the data for downstream analysis.  Gain awareness of alternative normalization methods and the potential for normalization to introduce biases.\n",
    "\n",
    "3. **Conduct dimensionality reduction using PCA and MDS:** Learn how to apply Principal Component Analysis (PCA) and Multidimensional Scaling (MDS) using R for visualizing high-dimensional proteomics data.  Interpret scree plots to identify relevant principal components, and interpret PCA biplots to understand the relationships between samples and variables (proteins). Recognize batch effects and learn how to correct for them using `removeBatchEffect` (for exploratory purposes) and understand the limitations and alternative methods for addressing batch effects in differential analysis.\n",
    "\n",
    "4. **Visualize data using heatmaps:**  Learn to create complex heatmaps using the `ComplexHeatmap` package in R, including adding annotations and customizing the appearance for effective data visualization and exploration of patterns in the data.\n",
    "\n",
    "5. **Gain practical experience with R and Bioconductor packages:**  Develop proficiency in using various R packages including `preprocessCore`, `limma`, `ComplexHeatmap`, `plyr`, `tidyverse`, and `factoextra` for data manipulation, normalization, dimensionality reduction, and visualization.\n",
    "\n",
    "6. **Prepare data for downstream differential analysis:** By performing exploratory analysis, understand the influence of covariates such as Treatment and Time, which informs how to proceed with a more rigorous differential analysis in subsequent steps.  The notebook also teaches how to save processed data for later use."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2687ba35",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "1. **Create a Vertex AI Notebooks instance:** Choose your machine type and other configurations.\n",
    "2. **Select the R kernel:**  Make sure you choose the appropriate R environment when creating or opening the notebook.\n",
    "3. **Install R Packages:** The notebook itself handles the installation of the necessary R packages (`devtools`,`factoextra`,`BiocManager` ,`limma`,`ComplexHeatmap`,`M3C`,`preprocessCore` ,`plyr`,`tidyverse`,`dplyr`, `ggplot2`, `IRdisplay`)\n",
    "4. **Data Storage:**  Upload your data to a Cloud Storage bucket and ensure your service account has access to read the data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fad852c-746f-4665-80ad-bf34dc68798d",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e95be7b-4276-4f71-a16e-192cbf82c21f",
   "metadata": {},
   "source": [
    "Omics data allows us to evaluate potential biomarkers on a systematic scale.  Instead of picking and choosing individual biomarkers, omics data allows us to identify large numbers of correlated variables that may indicate broad biological trends.  There are several benefits of such a strategy.  First, we may identify unknown or unanticipated biomarkers that were not obvious candidates but which are clearly correlated to (but not necessarily causative of) disease states.  Second, we can map this data to biological networks, allowing us to identify metabolic pathways correlated to disease states or to identify regulatory networks that can indicate mechanisms by which the state changes are occurring.  The cost of using omics data however is the so-called \"curse of dimensionality\".  Omics data involves thousands of features (e.g. expressed genes, active proteins, etc.), most of which are not actively changed across states.  Identifying the correlated variables and determining their statistical and biological significance is a challenge when dealing with a single omics layer such as the proteome.  The challenge becomes even greater when attempting to correlate results across different omics layers (known as multiomics) or when comparing similar results across laboratories (meta-analysis).  For this module, we will focus on a traditional bulk proteomics experiment from the rat renal IRI data to identify proteins significantly affected by IRI."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3440e627-8984-4f92-9637-41bd479222c0",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>&#9995; Tip:</b> Blue boxes will indicate helpful tips.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de503954-faa3-4caf-a187-ade2d6ac46c7",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Note:</b> Used for interesting asides or notes.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d37299fd-ac17-44b8-8af3-64ab55d55559",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-success\">\n",
    "<b>&#9997; Reference:</b> This box indicates a reference for an attached figure or table.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59fff243-9787-4690-8924-ecf71b2c7fa9",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "<b>&#128721; Caution:</b> A red box indicates potential hazards or pitfalls you may encounter.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1a533d6-5fa4-48f2-bae3-23220b03d75f",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "650ae720-e6f0-451f-9fae-5b036bec7608",
   "metadata": {},
   "source": [
    "## Get Started\n",
    "### Identification of Proteomic Biomarkers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a57eab27-34b3-4d77-af02-55ff48a03fe9",
   "metadata": {},
   "source": [
    "### Pre-Analysis Prep"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbc0cb93-ec6b-4475-9bf7-40103030c175",
   "metadata": {},
   "source": [
    "Let's start by loading the required R packages and our experimental object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b9a966c3-15fc-439d-94e5-d48bbb27a7cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "if(!require(devtools)) install.packages(\"devtools\")\n",
    "devtools::install_github(\"kassambara/factoextra\")\n",
    "\n",
    "if (!require(\"BiocManager\", quietly = TRUE))\n",
    "    install.packages(\"BiocManager\")\n",
    "library('BiocManager')\n",
    "\n",
    "bioc_packages <- c(\"limma\", \"ComplexHeatmap\", \"M3C\")\n",
    "installed_bioc_packages <- bioc_packages %in% rownames(installed.packages())\n",
    "if (any(installed_bioc_packages == FALSE)) {BiocManager::install(bioc_packages[!installed_bioc_packages])}\n",
    "\n",
    "BiocManager::install(\"preprocessCore\")\n",
    "BiocManager::install(\"preprocessCore\", configure.args=\"--disable-threading\", force = TRUE)\n",
    "\n",
    "packages <- c(\"plyr\", \"tidyverse\", \"factoextra\")\n",
    "installed_packages <- packages %in% rownames(installed.packages())\n",
    "if (any(installed_packages == FALSE)) {install.packages(packages[!installed_packages])}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34521b02-e160-4910-90f8-c6d72f548c8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "require('preprocessCore')\n",
    "require('limma')\n",
    "require('ComplexHeatmap')\n",
    "require('plyr')\n",
    "require('tidyverse')\n",
    "require('factoextra')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3d4a920-8c99-4694-8f40-0bb18be40568",
   "metadata": {},
   "outputs": [],
   "source": [
    "exp_obj <- readRDS(file = \"data/Saved_Data/exp_obj.rds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3be0b50-9b95-473e-b507-3aa8f9bae00f",
   "metadata": {},
   "source": [
    "### Data Normalization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e195641-88ed-4df6-9614-2cd6ddb44779",
   "metadata": {},
   "source": [
    "When working with data, it's often a good idea to work with copies of your data instead of the raw data itself to avoid accidentally modifying the original data.  In our case, it's not a big deal since we could just reload the experimental object, but we can save ourselves some typing by saving the proteome data and metadata into separate variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1392a22-2bab-48d8-b7ff-fa887d591b97",
   "metadata": {},
   "outputs": [],
   "source": [
    "#loading the experimental object and extract the metadata\n",
    "proteome <- exp_obj$data$proteomics$log2\n",
    "metadata <- exp_obj$metadata\n",
    "metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b896f15-f0e6-4fa1-a1a3-9d93922f2537",
   "metadata": {},
   "source": [
    "Remember that our data was log2 transformed.  Because the proteomic signal data spans a wide range of values, log transforming the data linearizes the data making it more normally distributed.  Let's look at a boxplot of the data to see how it looks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "550d9b2f-cfad-4be6-b724-05731f78e5c9",
   "metadata": {},
   "outputs": [],
   "source": [
    "head(proteome)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb0b150e-66b5-4496-a45e-0199e56ad2f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "boxplot(proteome)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94b5376d-7e3b-4939-ac5b-e03df9198370",
   "metadata": {},
   "source": [
    "We can go a step further and quantile normalize the data.  In quantile normalization, we are making the distribution of the data in each sample identical in their statistical properties.  We will use the <i>normalize.quantiles</i> function from the <b>preprocessCore</b> package to do the normalization. This step is important to decrease variations or noise that could cause results to skew within the dataset. This also provides power to downstream analysis by grouping liked values together to select important features which we will be doing in submodule 9. \n",
    "\n",
    "<div class=\"alert alert-block alert-success\">\n",
    "<b>&#9997; Reference:</b> For more information see the https://academic.oup.com/bib/article/19/1/1/2562889\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1fffa722-9e8b-4ce5-b9d5-09b8e2f936cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "#normalize the proteome data\n",
    "proteome_norm <- normalize.quantiles(proteome)\n",
    "rownames(proteome_norm) <- rownames(proteome)\n",
    "colnames(proteome_norm) <- colnames(proteome)\n",
    "head(proteome_norm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69680df6-904d-4b4b-af22-5dbd8f73f6d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "boxplot(proteome_norm)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8f9b224-7799-4026-aa1e-cd7a8aeba4af",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "<b>&#128721; Caution:</b> There are many ways to normalize data, and even different strategies for using quantile normalization.  No normalization method is perfect and some may even introduce error into your data.  Tools such as the <b>NormalyzerDE</b> package in BioConductor can help you determine the best normalization method for your data.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b8acff9-352d-406d-9a64-8a9ed39e4bfb",
   "metadata": {},
   "source": [
    "### Dimensionality Reduction (PCA and MDS)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7934e697-8bd0-4885-acff-1d1dcca34c4e",
   "metadata": {},
   "source": [
    "Now we'll begin the exploratory analysis.  Let's start with a principal component analysis (PCA).  PCA is useful to get a feel for similarities in our sample proteomes and if there are any batch effects we need to account for.  PCA is a dimensionality reduction technique in which we are linearly transforming our data to a new set of axes (called principal components) that define the variability in the data.  Typically, the first 2-3 principal components (PC) account for the majority of variation in the data, making visualizing the data much easier.  PCA also allows us to identify groups of correlated variables which help separate groups of similar samples (see biplots below). \n",
    "\n",
    "There are many ways to calculate and visualize PCA data in R.  We'll use the base R <i>prcomp</i> function and the visualization tools in the <b>factoextra</b> package.  First, let's create a scree plot to see which of our principal components are relevant.  The scree plot is a bar chart showing the amount of variability accounted for by each PC."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5a1e61d-8af5-4f7b-a560-3cba7001ecc5",
   "metadata": {},
   "outputs": [],
   "source": [
    "#the Dimensions shown in the plot below can also be called Principal Components\n",
    "proteome_pca <- prcomp(t(na.omit(proteome_norm)), center = TRUE, scale. = TRUE)\n",
    "fviz_eig(proteome_pca)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a654d9c-1149-47e4-9518-5df97e71b076",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>&#127891; Note:</b> Reminder that PCA decomposes the data matrix <b>X</b> into matrices <b>U</b> and <b>V</b>, where\n",
    "\n",
    "$$X = U \\cdot V = UV^T$$\n",
    "\n",
    "$U$ = Scores matrix, which is the original data rotated into the new coordinate system\n",
    "\n",
    "$V$ = Loadings matrix, which is the weights applied to each data point in the new coordinate system\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e89e3d47-4e46-4c93-bec7-c3d19166f439",
   "metadata": {},
   "source": [
    "Most of the variability seems to be in the first two PC's, so that's what we'll plot.  Often you will see on a PCA plot that we include the information for both the individual data points and the variables (we call this a biplot).  This is often useful to us because it gives us an indication of which variables are defining the groups we see in our PCA.  We won't plot the variables right now since adding 2000+ proteins to the plot will be too messy to be useful.  In general though, the biplot tends to follow the same patterns as the linear regression results from differential analysis, which we will see in the next chapter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "165287e3-4354-47b6-a90c-89b9c07e0172",
   "metadata": {},
   "outputs": [],
   "source": [
    "#plot the PCA plot by Treatment\n",
    "fviz_pca_ind(proteome_pca,\n",
    "             col.ind = metadata$Treatment,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Treatment\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )\n",
    "#plot the PCA plot by Time\n",
    "fviz_pca_ind(proteome_pca,\n",
    "             col.ind = metadata$Time,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Time\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14751ca6-3fc7-474c-9306-21f564c9249a",
   "metadata": {},
   "source": [
    "At first glance, there appears to be a strong separation by time, with earlier time points primarily in their own cluster.  However, if you remember, this data was collected in batches and therefore we have to account for any potential batch effects.  Let's replot the data based on that batch column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b6a1cd7-aecd-4f8b-a0db-b1bf9b91a0c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "fviz_pca_ind(proteome_pca,\n",
    "             col.ind = metadata$Batch,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Batch\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07a56a00-0797-452e-894f-f15066ebf219",
   "metadata": {},
   "source": [
    "We can now clearly see that this separation is based on the batch effect which must be corrected.  For the purposes of exploratory analysis, we will use the <i>removeBatchEffect</i> function in <b>limma</b>."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68ecfc98-b769-4672-93f1-f7dcdc3b7d43",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "<b>&#128721; Caution:</b> Batch correction can be tricky in that it can sometimes introduce strange artifacts into your data.  Using <i>removeBatchEffect</i> is fine for exploratory analysis, but for differential analysis, it is better to include the batch column as an additional covariate or as a random effect in a linear mixed model. We will use this method when we do the differential analysis.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7927e53c-ce01-467f-9540-f4ae86645c7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "proteome_norm_batch <- removeBatchEffect(proteome_norm, batch = metadata$Batch)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af90c76f-b73d-4083-aaed-03b6d408023b",
   "metadata": {},
   "outputs": [],
   "source": [
    "#replot the scree plot and PCA plots without batch effect\n",
    "proteome_norm_batch_pca <- prcomp(t(na.omit(proteome_norm_batch)), center = TRUE, scale. = TRUE)\n",
    "fviz_eig(proteome_norm_batch_pca)\n",
    "fviz_pca_ind(proteome_norm_batch_pca,\n",
    "             col.ind = metadata$Treatment,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Treatment\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )\n",
    "fviz_pca_ind(proteome_norm_batch_pca,\n",
    "             col.ind = metadata$Time,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Time\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92393218-2f29-44d8-aa8f-5f8d3461c041",
   "metadata": {},
   "source": [
    "Now we see that the batch effect is gone, but our data also doesn't cluster as strongly, though there does seem to be some separation in later time points.  Let's clean these plots up by limiting the PCA to only the 100 most highly variable proteins.  This code may look a bit complicated, but what it is doing is calculating the variance of each row, sorting the rows, and extracting the top 100 most variable rows, and then run the PCA on that subset.  In this way, we might see stronger separation."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74eb49c9-9679-4de9-a95f-10d7d8b61d1c",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "<b>&#128721; Caution:</b> For exploratory analysis, limiting the analysis to the most variable rows is fine.  When we do differential analysis however, we will need to include most or all data points in our regression model.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c8da43d-ae21-4ca9-b84b-257f96e64030",
   "metadata": {},
   "outputs": [],
   "source": [
    "#scree plot with the top 100 variable proteins\n",
    "high_var = order(apply(proteome_norm_batch, 1, var), decreasing=TRUE)[1:100]\n",
    "proteome_norm_batch_pca_sub <- prcomp(t(na.omit(proteome_norm_batch[high_var,])), center = TRUE, scale. = TRUE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11573cb2-3f25-43b9-b451-fe6b001fc324",
   "metadata": {},
   "outputs": [],
   "source": [
    "fviz_eig(proteome_norm_batch_pca_sub)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54c496c5-3dd8-4693-9848-9ef513f270c1",
   "metadata": {},
   "source": [
    "Our new scree plot suggests a fair amount of variability in the third PC, so we can plot that as well. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b821e94-e225-434f-84f1-1631af07deef",
   "metadata": {},
   "outputs": [],
   "source": [
    "fviz_pca_ind(proteome_norm_batch_pca_sub,\n",
    "             col.ind = metadata$Treatment,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Treatment\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )\n",
    "#PCA plot looking at the 3 PC or dimension based on the Treatment variable\n",
    "fviz_pca_ind(proteome_norm_batch_pca_sub,\n",
    "             axes = c(1,3),\n",
    "             col.ind = metadata$Treatment,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Treatment\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )\n",
    "fviz_pca_ind(proteome_norm_batch_pca_sub,\n",
    "             col.ind = metadata$Time,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Time\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )\n",
    "#PCA plot looking at the 3 PC or dimension based on the Time variable\n",
    "fviz_pca_ind(proteome_norm_batch_pca_sub,\n",
    "             axes = c(1,3),\n",
    "             col.ind = metadata$Time,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Time\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "441984dc-e5e9-45b6-90cd-59ca8ad93b96",
   "metadata": {},
   "source": [
    "Now we can start to see some separation between the control/sham and the placebo/treprostinil samples, and maybe some separation between placebo and treprostinil samples on the third PC.  This is consistent with what we saw on the scatter plots in the previous chapter.  We want to be careful not to read too much into the PCA plots, but this suggests that when we do the differential analysis, we will probably need to account for both covariates in some fashion.  Let's see which proteins are driving this separation.  Since we are only looking at 100 proteins, let's see what the variable plot looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c136d11-7940-4480-a150-171792ab6205",
   "metadata": {},
   "outputs": [],
   "source": [
    "fviz_pca_var(proteome_norm_batch_pca_sub,\n",
    "             col.ind = metadata$Treatment,\n",
    "             addEllipses = FALSE,\n",
    "             repel = TRUE,\n",
    "             )\n",
    "fviz_pca_var(proteome_norm_batch_pca_sub,\n",
    "             axes = c(1,3),\n",
    "             col.ind = metadata$Treatment,\n",
    "             addEllipses = FALSE,\n",
    "             repel = TRUE,\n",
    "             )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32a12aca-82fc-4a04-9809-3833e5dcd103",
   "metadata": {},
   "source": [
    "The variable plot suggests that there are clusters of correlated proteins driving separation of the groups along PCs 1 and 2 and maybe 3.  The length informs contribution strength while direction determines effect on PC."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c41a598a-460f-4a85-b8f0-dab43aab694e",
   "metadata": {},
   "source": [
    "We can also combine these two views to make a biplot which shows all of the data points and associations on the same plot.  This makes it easier to see how different features (in this case, protein expressions) are driving the variability in the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "754947eb-1a23-4595-879b-73a691d27d05",
   "metadata": {},
   "outputs": [],
   "source": [
    "#compares the 1st vs 2nd PCs based on the Treatment variable\n",
    "fviz_pca_biplot(proteome_norm_batch_pca_sub,\n",
    "             col.ind = metadata$Treatment,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Treatment\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )\n",
    "#compares the 1st vs 3nd PCs based on the Treatment variable\n",
    "fviz_pca_biplot(proteome_norm_batch_pca_sub,\n",
    "             axes = c(1,3),\n",
    "             col.ind = metadata$Treatment,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Treatment\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )\n",
    "#compares the 1st vs 2nd PCs based on the Time variable\n",
    "fviz_pca_biplot(proteome_norm_batch_pca_sub,\n",
    "             col.ind = metadata$Time,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Time\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )\n",
    "#compares the 1st vs 3nd PCs based on the Time variable\n",
    "fviz_pca_biplot(proteome_norm_batch_pca_sub,\n",
    "             axes = c(1,3),\n",
    "             col.ind = metadata$Time,\n",
    "             addEllipses = TRUE,\n",
    "             repel = TRUE,\n",
    "             ellipse.type = \"confidence\",\n",
    "             legend.title = \"Time\",\n",
    "             pointsize = 3,\n",
    "             label = \"none\"\n",
    "             )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31412538-77af-4e83-88f3-7b0d9f4ccf83",
   "metadata": {},
   "source": [
    "When we do differential analysis in the next sub-module, we should expect many of these same proteins to come out in the analysis.  Just as a second check, let's look at the data using classical multidimensional scaling (MDS).  MDS is a dimensionality reduction technique that uses a pairwise distance matrix generated from the data to map the data points into a new coordinate frame."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e05f067-a078-4b79-80d2-78a0a1dcba72",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Note:</b> PCA is a special case of MDS (MDS applied to Euclidean distances).\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2fb7aaa8-2a02-4005-a72e-92281e3d33f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "#MDS plot based on Treatment\n",
    "col.cell <- c(\"magenta\",\"orange\", \"blue\", \"cyan\")[metadata$Treatment]\n",
    "plotMDS(na.omit(proteome_norm_batch), col = col.cell)\n",
    "legend(\"topleft\",fill=c(\"magenta\",\"orange\", \"blue\", \"cyan\"),legend=levels(metadata$Treatment))\n",
    "\n",
    "#MDS plot based on Time\n",
    "col.cell <- c(\"magenta\",\"orange\", \"blue\", \"cyan\", \"black\", \"green\")[metadata$Time]\n",
    "plotMDS(na.omit(proteome_norm_batch), col = col.cell)\n",
    "legend(\"topleft\",fill=c(\"magenta\",\"orange\", \"blue\", \"cyan\", \"black\", \"green\"),legend=levels(metadata$Time))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2b90972-21b6-4c09-adeb-b4e3b3497a36",
   "metadata": {},
   "source": [
    "These plots are consistent with the PCA plots.  While the groupings are not as strongly separated as we would like, this is common in real data which is why the differential analysis is so important.  Now let's look at the data as a heat map to see if any patterns emerge.  First, we will <i>z</i>-scale the data which centers the data on 0 and makes it easier to distinguish differences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7beb388e-bfee-470c-bfb9-3075ee7eba05",
   "metadata": {},
   "outputs": [],
   "source": [
    "scaled_data <- t(scale(t(proteome_norm_batch)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "135e5030-e056-438c-9f7f-0ddc956215a1",
   "metadata": {},
   "source": [
    "Now we'll build a heatmap using BioConductor's <b>ComplexHeatmap</b> package.  First we'll define the annotations using the <i>HeatmapAnnotation</i> and then send those annotations to the <i>Heatmap</i> function.  Our annotations will include colored column labels for our two covariates (Treatment and Time) with accompanying legends."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1112eed2-0550-41e0-9755-e8400e777528",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#9995; Tip:</b> There are many ways to generate heatmaps in R.  We use <b>ComplexHeatmap</b> because it's part of BioConductor and it easily makes customizable publication quality figures.  Others options inclued the <b>heatmaps</b> package in BioConductor, the <i>heatmap.2</i> function in the <b>gplots</b> package, the <i>geom_tile</i> geometry in <b>ggplot2</b></div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6e24a77-6511-4596-b306-df4be74428fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "ht_opt$message = FALSE\n",
    "\n",
    "# HeatmapAnnotation object\n",
    "column_ha <- HeatmapAnnotation(\n",
    "    Treatment = metadata$Treatment, \n",
    "    Time = metadata$Time,\n",
    "  col = list(\n",
    "    Treatment = c(\"CTRL\" = \"blue\", \"PLB\" = \"cyan\", \"SHAM\" = \"yellow\", \"TRE\" = \"magenta\"),\n",
    "    Time = c(\"0\" = \"violet\", \"1\" = \"blue\", \"3\" = \"green\", \"6\" = \"yellow\", \"24\" = \"orange\", \"48\" = \"red\")\n",
    "  ),\n",
    "  gp = gpar(col = \"black\")\n",
    ")\n",
    "\n",
    "# Generate heatmap\n",
    "Heatmap(\n",
    "  scaled_data,\n",
    "  name = \"Z\",\n",
    "  top_annotation = column_ha,\n",
    "  show_row_names = FALSE,\n",
    "  show_column_names = FALSE\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73e58d5e-2d37-4b85-91a7-56e7d8b50859",
   "metadata": {},
   "source": [
    "While we do see some patterns with the full data, it might be useful to pare this down to the most variable proteins.  Let's start with 50 for clarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca21c600-d8d9-4bbd-8bee-d1c447e0cb40",
   "metadata": {},
   "outputs": [],
   "source": [
    "#narrown data set to top 50 proteins\n",
    "high_var_50 = order(apply(proteome_norm_batch, 1, var), decreasing=TRUE)[1:50]\n",
    "scaled_data_50 <- t(scale(t(proteome_norm_batch[high_var_50,])))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "933c9ca7-af48-4fe8-a2e6-50d9fbee8901",
   "metadata": {},
   "outputs": [],
   "source": [
    "Heatmap(\n",
    "  scaled_data_50,\n",
    "  name = \"Z\",\n",
    "  top_annotation = column_ha,\n",
    "  show_row_names = TRUE,\n",
    "  show_column_names = FALSE\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f55595fa-d61e-48ce-89d6-2e873df4f549",
   "metadata": {},
   "source": [
    "In this view, we see many of the same genes that we saw appear in our biplot above.  There is some grouping based on Treatment and Time so we will try to account for this in the differential analysis in the next chapter."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22e7848e-55fa-4d8b-ac54-4a82e5368e91",
   "metadata": {},
   "source": [
    "Finally, we'll save our normalized data to our experimental object so that we don't have to regenerate it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b0a3eb81-2af9-4df3-b982-0afd67a698db",
   "metadata": {},
   "outputs": [],
   "source": [
    "exp_obj$data$proteomics$norm <- proteome_norm\n",
    "saveRDS(exp_obj, file = \"data/Saved_Data/exp_obj.rds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10aa1974-b6bc-4e9a-8a52-86d395cb6f77",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "360c119d-44f0-441e-b5e6-17324dde8ebb",
   "metadata": {},
   "source": [
    "### Results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2a0d4d0-5511-41cc-ba66-22942f3767ab",
   "metadata": {},
   "source": [
    "Our exploratory analysis of the proteomic data suggests possible effects from both treatment state and time points affecting protein concentrations in the IRI samples.  We have identified proteins most variable between the sample groups and which may be potential biomarkers for different IRI states.  In the next chapter, we will conduct a proper differential analysis of the proteome data to determine which proteins are significantly changed between states."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54066b0a-5d5b-4d04-855e-0b2ac7d5a7f6",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2d998717-e13e-4176-a86b-081d52c88636",
   "metadata": {},
   "source": [
    "<p><span style=\"font-size: 30px\"><b>Quizzes</b></span> <span style=\"float : inline;\">(run the command below to display the quizzes)</span> </p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "534a0802-1bf4-41e8-b1d8-fb002582aaad",
   "metadata": {},
   "outputs": [],
   "source": [
    "IRdisplay::display_html('<iframe src=\"quizes/Chapter7_Quizes.html\" width=100% height=450></iframe>')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcf88f65",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdccd26d",
   "metadata": {},
   "outputs": [],
   "source": [
    "sessionInfo()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb527456",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9129babf",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "This exploratory analysis provided valuable insights into the rat renal IRI proteomic dataset. Dimensionality reduction techniques like PCA and MDS, coupled with visualization methods such as heatmaps and biplots, revealed potential protein biomarkers associated with treatment and time post-IRI. While the observed clustering patterns require further investigation through rigorous differential analysis, the identified highly variable proteins offer promising leads for understanding the biological mechanisms underlying IRI and the potential therapeutic effects of Treprostinil. The normalized data has been saved for subsequent analysis in the next sub-module, where statistical testing will determine the significance of these initial findings."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50a1adae",
   "metadata": {},
   "source": [
    "## Clean up\n",
    "\n",
    "Remember to move to the next notebook or shut down your instance if you are finished."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f5a79b9-6625-4748-824c-64c1ce347447",
   "metadata": {},
   "source": [
    "<div style=\"display: flex; justify-content: center; margin-top: 20px; width: 100%;\"> \n",
    "    <div style=\"display: flex; justify-content: space-between; width: 50%;\"> \n",
    "        <div> \n",
    "            <a href=https://github.com/NIGMS/Analysis-of-Biomedical-Data-for-Biomarker-Discovery/blob/master/GoogleCloud/Submodule06_Linear_and_Logistic_Regression_Biomarkers.ipynb#overview>Previous section</a>                                            \n",
    "        </div> \n",
    "        <div> \n",
    "            <a href=\"#top7\">Top of this page</a>                                                      \n",
    "        </div> \n",
    "        <div> \n",
    "            <a href=https://github.com/NIGMS/Analysis-of-Biomedical-Data-for-Biomarker-Discovery/blob/master/GoogleCloud/Submodule08_Differential_Analysis_Proteomics.ipynb#overview>Next section</a>\n",
    "        </div> \n",
    "    </div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99f80099-1044-4d04-a212-5d84b4a22d15",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8b4a3ad-acf8-4c8b-94f2-11f2c0c30284",
   "metadata": {},
   "source": [
    "[Hou J, Tolbert E, Birkenbach M, Ghonem NS. Treprostinil alleviates hepatic mitochondrial injury during rat renal ischemia-reperfusion injury. Biomed Pharmacother. 2021 Nov;143:112172. doi: 10.1016/j.biopha.2021.112172. Epub 2021 Sep 21. PMID: 34560548; PMCID: PMC8550798.][hou]<br>\n",
    "[Ding M, Tolbert E, Birkenbach M, Gohh R, Akhlaghi F, Ghonem NS. Treprostinil reduces mitochondrial injury during rat renal ischemia-reperfusion injury. Biomed Pharmacother. 2021 Sep;141:111912. doi: 10.1016/j.biopha.2021.111912. Epub 2021 Jul 15. PMID: 34328097; PMCID: PMC8429269.][ding]<br>\n",
    "\n",
    "[ding]: https://pubmed.ncbi.nlm.nih.gov/34328097/ \"Ding M, Tolbert E, Birkenbach M, Gohh R, Akhlaghi F, Ghonem NS. Treprostinil reduces mitochondrial injury during rat renal ischemia-reperfusion injury. Biomed Pharmacother. 2021 Sep;141:111912. doi: 10.1016/j.biopha.2021.111912. Epub 2021 Jul 15. PMID: 34328097; PMCID: PMC8429269.\"\n",
    "[hou]: https://pubmed.ncbi.nlm.nih.gov/34560548/ \"Hou J, Tolbert E, Birkenbach M, Ghonem NS. Treprostinil alleviates hepatic mitochondrial injury during rat renal ischemia-reperfusion injury. Biomed Pharmacother. 2021 Nov;143:112172. doi: 10.1016/j.biopha.2021.112172. Epub 2021 Sep 21. PMID: 34560548; PMCID: PMC8550798.\"\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R (Local)",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.4.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}