Biomedical-Data-Analysis / Git / [cb2392] /GoogleCloud/Submodule03_Intro_to_Linear

Models:
DanielG/
Biomedical-Data-Analysis
Downloads: 1
[cb2392]: / GoogleCloud / Submodule03_Intro_to_Linear_Models.ipynb
History
Download this file
1780 lines (1779 with data), 75.8 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dd89e273-6760-4654-8014-71ac66b08a38",
   "metadata": {},
   "source": [
    "<img src=\"images/RIINBRE-Logo.jpg\" width=\"400\" height=\"400\"><img src=\"images/MIC_Logo.png\" width=\"600\" height=\"600\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5a71e1d-cc9e-4527-bff0-27cf4eb5ed48",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43c08b43-5eb3-4fc2-a2f9-0ace910cfef3",
   "metadata": {},
   "source": [
    "# Analysis of Biomedical Data for Biomarker Discovery\n",
    "<a id=\"top3\"></a>\n",
    "## Submodule 3: Introduction to Linear Models\n",
    "### Dr. Christopher L. Hemme\n",
    "### Director, [RI-INBRE Molecular Informatics Core](https://web.uri.edu/riinbre/mic/)\n",
    "### The University of Rhode Island College of Pharmacy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c80b7b6-f550-406f-ab15-3b52ccf950f1",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6972c05f",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "This notebook provides an introduction to linear models, focusing on their application in biomarker discovery from biomedical data. It begins with a brief overview of vectors and matrices, including basic operations like addition, multiplication, and the inner product, illustrated with examples and R code visualizations.  The notebook then introduces the generalized linear model (GLM) framework, explaining its components: the probability distribution function, linear predictor, and link function.  It subsequently delves into linear regression, ANOVA, and t-tests as special cases of the GLM, explaining the concept of least squares and demonstrating simple and multiple linear regression using the `iris` dataset in R.  Logistic regression is then introduced for binary classification problems, demonstrating its use with the `mtcars` dataset and explaining the logit function, odds ratio, sensitivity, specificity, and ROC curves.  The notebook concludes with a discussion of experimental design considerations and more advanced linear models like multi-factorial models, repeated-measures models, multivariate models, linear mixed models, and non-linear models. Quizzes and references for further reading are also included.  The notebook uses R code throughout for calculations and visualizations, leveraging packages like `matlib`, `pROC`, and `glm2`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b14f02e",
   "metadata": {},
   "source": [
    "## Learning Objectives\n",
    "\n",
    "+ **Introduce fundamental concepts of linear algebra:** This includes understanding vectors and matrices, performing basic vector and matrix operations (addition, scalar multiplication, transposition, inner product), and understanding their geometric interpretations.\n",
    "\n",
    "+ **Explain the Generalized Linear Model (GLM):** The notebook introduces the GLM framework, emphasizing its components (probability distribution, linear predictor, link function) and how specific models like linear regression, ANOVA, and t-tests are special cases of the GLM.  It also covers the concept of a design matrix and its role in the GLM.\n",
    "\n",
    "+ **Teach practical application of linear regression:**  Learners are shown how to perform simple and multiple linear regression in R using the `lm()` function, interpret the results (coefficients, R-squared), and visualize the relationships between variables. The importance of considering different species in the analysis and how this influences the interpretation of results are discussed.\n",
    "\n",
    "+ **Introduce logistic regression for binary classification:** The notebook explains the limitations of linear regression for binary data and introduces logistic regression as a more appropriate model.  It covers the logit function, interpretation of coefficients (log odds, odds ratio), and evaluation of model performance using ROC curves and AUC. It demonstrates how to perform logistic regression in R using the `glm()` function with the binomial family.\n",
    "\n",
    "+ **Provide a foundation for further exploration:** The notebook encourages learners to explore more advanced topics like multi-factorial models, repeated-measures models, multivariate models, linear mixed models, and non-linear models, providing relevant resources for continued learning.  It emphasizes the importance of experimental design and its connection to data analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11fae0a4",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "1. **Create a Vertex AI Notebooks instance:** Choose your machine type and other configurations.\n",
    "2. **Select the R kernel:**  Make sure you choose the appropriate R environment when creating or opening the notebook.\n",
    "3. **Install R Packages:** The notebook itself handles the installation of the necessary R packages (`matlib`, `pROC`, `glm2`, `tidyverse`).\n",
    "4. **Data Storage:**  Upload your data to a Cloud Storage bucket and ensure your service account has access to read the data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36d8f98c-8593-4359-9bc7-4b9b2b80d95a",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0832c47-fbe1-4e3f-87b6-0d26f4a02688",
   "metadata": {},
   "source": [
    "At some point in high school algebra, you probably asked yourself \"When am I ever going to have to use this?\"  The answer is \"Today.\"  What you learned in algebra - the equation of a line, solving systems of linear equations, etc. - were special cases of a broader field of mathematics called <b>linear algebra</b>.  In the real world, we rarely see the clean ideal data sets that we worked with in high school algebra.  Instead, we are often working with messy datasets that are influenced by many types of variability.  Even though the data is messy, it can often still be approximated by linear models, and these models form the basis of a wide range of analytical techniques in STEM fields.  Linear algebra is the study of linear equations and their representations in vector space.  This means that instead of the simple data sets we worked with in algebra, we're usually working with datasets represented as vectors and matrices, and the R programming language that we use for this learning module is designed specifically to work with these data structures (see <b>Submodule 6: Introduction to R Data Structures</b>).  Linear models are particularly important in biomedical and bioinformatics data analysis as they form the basis of linear regression, ANOVA, <i>t</i>-test, principle components analysis, and many other techniques that are commonly used in omics data analysis.\n",
    "\n",
    "In this submodule we will cover basic vector and matrix operations, the generalized linear model and its applications, and common applications of the GLM (specifically linear and  logistic regression).  This will provide you the foundation for understanding the proteomics analyses covered in <b>Submodule 8 - Identification of IRI Biomarkers from Proteomics Data</b>.  For students interested in a career in bioinformatics, we strongly recommend taking a formal course in linear algebra which will cover these topics and more in much greater detail.  We also provide references at the end that users might find useful for further study."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b9e6dc3-2ea0-48be-930f-27cc49223379",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>&#9995; Tip:</b> Blue boxes will indicate helpful tips.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22eada68-df92-4f45-802a-92a3e6615633",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Note:</b> Used for interesting asides or notes.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50348e42-0b42-44c7-aa4f-c47230682cea",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-success\">\n",
    "<b>&#9997; Reference:</b> This box indicates a reference for an attached figure or table.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1504b86e-cccd-4a38-b1c6-c56f0af2e46a",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "<b>&#128721; Caution:</b> A red box indicates potential hazards or pitfalls you may encounter.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8eaff7b-f00e-43a1-83ae-229e9a931f61",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1d8666c-d158-4996-a8c9-89d585b0a165",
   "metadata": {},
   "source": [
    "## Get Started\n",
    "### Load R Modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d15e119-9be0-477e-a6a2-929fa846eb27",
   "metadata": {},
   "outputs": [],
   "source": [
    "packages <- c(\"matlib\", \"pROC\", \"glm2\")\n",
    "installed_packages <- packages %in% rownames(installed.packages())\n",
    "if (any(installed_packages == FALSE)) {install.packages(packages[!installed_packages])}\n",
    "\n",
    "#matlib is a package for teaching linear algebra concepts\n",
    "#glm2 is a package for fitting generalized linear models\n",
    "#pROC is a package for plotting ROC curves"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98fd3753-3e6e-44ab-a77e-31784e1716a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "require('tidyverse')\n",
    "require('glm2')\n",
    "require('pROC')\n",
    "require('matlib')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65f4d5af-6b97-4bca-a5bc-b0ff2828417b",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfd8455a-3f95-4d5d-8237-9efab81b13c9",
   "metadata": {},
   "source": [
    "### Vectors and Vector Operations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2eb2f3c-1c49-4723-ba98-352baacac2af",
   "metadata": {},
   "source": [
    "A vector is an ordered list of elements, by convention ordered vertically.  We can refer to a vector of length <i>n</i> as an <b><i>n</i>-vector</b>."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a475b420-30ff-4aab-8d57-a55cc626511a",
   "metadata": {},
   "source": [
    "2-Vectors: $$a = \\begin{bmatrix} 6 \\\\ 4\\end{bmatrix}$$ $$b = \\begin{bmatrix} 5 \\\\ 2\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61a67bff-7e26-4805-97bd-5c5548b69346",
   "metadata": {},
   "source": [
    "A vector can be represented as a geometric object.  The above vector can be represented in two-dimensional Cartesian space as a line originating at the origin and ending at the point (6,4).  This means that the vector has both a direction and a magnitude (length).  The magnitude is calculated by:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68325eda-5b17-4e45-9fa2-a8a118989601",
   "metadata": {},
   "source": [
    "$$\\lvert\\lvert x \\rvert\\rvert = \\sqrt{x_1^2+x_2^2+...x_n^2}$$\n",
    "\n",
    "$$\\lvert\\lvert a \\rvert\\rvert = \\sqrt{a_1^2+a_2^2}$$\n",
    "$$= \\sqrt{6^2 + 4^2}$$\n",
    "$$= \\sqrt{52}$$\n",
    "$$= 7.2$$\n",
    "\n",
    "$$\\lvert\\lvert b \\rvert\\rvert = \\sqrt{b_1^2+b_2^2}$$\n",
    "$$= \\sqrt{5^2 + 2^2}$$\n",
    "$$= \\sqrt{29}$$\n",
    "$$= 5.3$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9feb1473-b47c-454f-96a6-f104c11b19a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "xlim <- c(0,6)\n",
    "ylim <- c(0,6)\n",
    "plot(xlim, ylim, type=\"n\")\n",
    "grid()\n",
    "a=c(6,4)\n",
    "vectors(a, labels=\"||a|| = 7.2\", pos.lab=2, frac.lab=.5, col = \"blue\")\n",
    "\n",
    "#This code uses base R and the matlib package for plotting a 2-vector in 2D space.  We first plot an empty 6x6 grid, then overlay the vector a = [6,4]."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "439c78b9-9c0c-4aef-870a-cc33eb611ddc",
   "metadata": {},
   "source": [
    "Vectors of the same size can be added together."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78a69e73-08de-489b-ab18-21c240fb237b",
   "metadata": {},
   "source": [
    "$$a = \\begin{bmatrix} a_1 \\\\ a_2\\end{bmatrix}, b = \\begin{bmatrix} b_1 \\\\ 2_2\\end{bmatrix}$$\n",
    "$$a + b = \\begin{bmatrix} a_1 + b_1 \\\\ a_2 + b_2\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d19d739a-20f7-4d44-888f-d75be122b2d7",
   "metadata": {},
   "source": [
    "$$a = \\begin{bmatrix} 4 \\\\ 6\\end{bmatrix} b = \\begin{bmatrix} 5 \\\\ 2\\end{bmatrix}$$\n",
    "$$a + b = \\begin{bmatrix} 9 \\\\ 8\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "200f14b9-465a-4a9e-90ab-22469d8546d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "xlim <- c(0,9)\n",
    "ylim <- c(0,9)\n",
    "plot(xlim, ylim, type=\"n\", xlab = \"x\", ylab = \"y\")\n",
    "grid()\n",
    "a <- c(4,6)\n",
    "b <- c(5,2)\n",
    "vectors(a, labels=\"a\", pos.lab=4, frac.lab=.5, col = \"blue\")\n",
    "vectors(b, labels=\"b\", pos.lab=4, frac.lab=.5, col = \"cyan\")\n",
    "vectors(a+b, labels=\"a+b\", pos.lab=4, frac.lab=.5, col=\"magenta\")\n",
    "vectors(a+b, labels=\"b\", pos.lab=4, frac.lab=.5, origin=a, col=\"cyan\")\n",
    "\n",
    "# We're now plotting vectors a, b, and a+b.  When we plot b alone, it starts from the origin.\n",
    "# The last command plots b starting from the end of a to better show the additive effect of the two vectors"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cfd2388-cabe-4ad1-aefc-cdb5fd58d98d",
   "metadata": {},
   "source": [
    "Finally, a vector can be multiplied by a scalar, that is, a constant value."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a92b856b-0122-463d-bca0-694fd07fedfa",
   "metadata": {},
   "source": [
    "$$a = \\begin{bmatrix} a_1 \\\\ a_2\\end{bmatrix}$$\n",
    "$$x * a = \\begin{bmatrix} x * a_1 \\\\ x * a_2\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b42966d-5b67-4b61-8d1e-d8889b3e7e76",
   "metadata": {},
   "source": [
    "$$a = \\begin{bmatrix} 4 \\\\ 6\\end{bmatrix}$$\n",
    "$$2 * a = \\begin{bmatrix} 8 \\\\ 12\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed904851-9a78-4ab1-a3c8-f08cefc2445b",
   "metadata": {},
   "outputs": [],
   "source": [
    "xlim <- c(0,12)\n",
    "ylim <- c(0,12)\n",
    "plot(xlim, ylim, type=\"n\", xlab = \"x\", ylab = \"y\")\n",
    "grid()\n",
    "a <- c(4,6)\n",
    "vectors(2*a, labels=\"2a\", pos.lab=4, frac.lab=.5, col=\"magenta\")\n",
    "vectors(a, labels=\"a\", pos.lab=4, frac.lab=.5, col = \"blue\")\n",
    "\n",
    "# Using matlib to demonstrate the multiplicative effect of a scalar"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "200ebc48-1405-4da5-a3ab-3cb37492ada3",
   "metadata": {},
   "source": [
    "A vector can be transposed, that is, flipped on its side."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcfb40bf-055a-42d1-902f-5a4dffab8551",
   "metadata": {},
   "source": [
    "$$a = \\begin{bmatrix} a_1 \\\\ a_2\\end{bmatrix}$$\n",
    "$$a^T = \\begin{bmatrix} a_1 & a_2\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb21c3ac-5a01-4fa5-bfd2-959707dae60b",
   "metadata": {},
   "source": [
    "$$a = \\begin{bmatrix} 4 \\\\ 6\\end{bmatrix}$$\n",
    "$$a^T = \\begin{bmatrix} 4 & 6\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c42d98a4-7bf1-4989-9e12-2fa6ad363139",
   "metadata": {},
   "source": [
    "We can now calculate one of the most important vector operations, the <b>inner product</b> (also called the <b>dot product</b>)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59c6b377-18e8-4dc0-9c7d-15a1657f7c46",
   "metadata": {},
   "source": [
    "$$a = \\begin{bmatrix} a_1 \\\\ a_2\\end{bmatrix}$$\n",
    "$$b = \\begin{bmatrix} b_1 \\\\ b_2\\end{bmatrix}$$\n",
    "$$a \\cdot b = ab^T = a_1 * b_1 + a_2 * b_2$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "646ec590-ad6d-47c8-a0c9-f9f857575c6b",
   "metadata": {},
   "source": [
    "$$a = \\begin{bmatrix} 4 \\\\ 6\\end{bmatrix}$$\n",
    "$$b = \\begin{bmatrix} 5 \\\\ 2\\end{bmatrix}$$\n",
    "$$a \\cdot b= 4 * 5 + 6 * 2$$\n",
    "$$a \\cdot b= 20 + 12$$\n",
    "$$a \\cdot b= 32$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58d097f8-183e-497d-ac32-9518de592f02",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Note:</b> Geometrically, the inner product is used to define the angle between two vectors by the equation:\n",
    "    $$cos\\;\\theta = \\frac{x \\cdot y}{||x||\\;||y||}$$\n",
    "    In the above examples, the angle $\\theta$ between <i>a</i> and <i>b</i> would be:\n",
    "    $$\\theta = arccos\\frac{a \\cdot b}{||a||\\;||b||}$$\n",
    "    $$\\theta = arccos\\frac{32}{7.2*5.3}$$\n",
    "    $$\\theta = 0.576\\;radians = 33^\\circ$$\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7677ff7-97ab-4c3f-83de-3491be62992f",
   "metadata": {},
   "source": [
    "The inner product is the weighted sum of the elements of the vectors.  Consider the follow example using simulated real estate data.  In which market can you expect to make the most profit?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea6d78b5-46eb-40fa-a93d-c376cfdd9f35",
   "metadata": {},
   "source": [
    "<table>\n",
    "<thead>\n",
    "    <tr><th>Market</th><th>Units for Sale</th><th>Average Price (in $1000)</th></tr>\n",
    "</thead>\n",
    "<tbody>\n",
    "    <tr><td>Texas</td><td>1000</td><td>750</td></tr>\n",
    "    <tr><td>Ohio</td><td>2500</td><td>300</td></tr>\n",
    "    <tr><td>Arizona</td><td>750</td><td>600</td></tr>\n",
    "    <tr><td>Alaska</td><td>500</td><td>300</td></tr>\n",
    "    <tr><td>Hawaii</td><td>250</td><td>1000</td></tr>\n",
    "</tbody>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d20baf2-3e0f-4502-81c7-dd6f7c900819",
   "metadata": {},
   "source": [
    "You might think based on units for sale that Ohio would be most valuable.  However, the Hawaii market has the highest average price.  The inner product of units and average price can tell us not only the total potential profit we can expect but also the relative contribution of each market to the total profit."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bab31ad8-e78c-4f39-a736-ded5b7a5e37c",
   "metadata": {},
   "source": [
    "$$Units = \\begin{bmatrix} 1000 \\\\ 2500 \\\\ 750 \\\\ 500 \\\\ 250\\end{bmatrix} Price = \\begin{bmatrix} 750 \\\\ 300 \\\\ 600 \\\\ 300 \\\\ 1000\\end{bmatrix}$$\n",
    "$$Potential Profit = Units \\cdot Price^T= 1000 * 750 + 2500 * 300 + 750 * 600 + 500 * 300 + 250 * 1000$$\n",
    "$$Potential Profit = 750000 + 750000 + 450000 + 150000 + 250000$$\n",
    "$$Potential Profit = 2,350,000$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf473183-9613-486e-bfc0-09a922336176",
   "metadata": {},
   "source": [
    "Let's illustrate this calculation graphically using R.  We'll verify the result by calculating the inner product using the R <b>%*%</b> operator and then plot the results by market."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca5d3ce7-8890-4266-b0f2-2acda5be055f",
   "metadata": {},
   "outputs": [],
   "source": [
    "market <- c(\"Texas\", \"Ohio\", \"Arizona\", \"Alaska\", \"Hawaii\")\n",
    "units <- c(1000,2500,750,500,250)\n",
    "price <- c(750,300,600,300,1000)\n",
    "totalProfit <- units %*% price # %*% calculates the inner product of units and price\n",
    "totalProfit"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d5aec6d-54d3-4373-b575-4a07aea9995d",
   "metadata": {},
   "source": [
    "Creating a bar plot of the components of the inner product. We first create a data frame containing the market data and weighted profits. \n",
    "\n",
    "We then pipe marketProfit into ggplot using the __%>%__ operator. We set the x and y variables to Market and Profits, respectively. We use geom_bar to indicate a bar plot and rename the y axis. Finally, we set the theme to black and white and change the font sizes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7d667c4-0e8f-4f6d-a889-31e822528fe1",
   "metadata": {},
   "outputs": [],
   "source": [
    "marketProfit <- data.frame(\n",
    "    Market = market,\n",
    "    Profits = c(750000, 750000, 450000, 150000, 250000)\n",
    ")\n",
    "marketProfit %>%\n",
    "    ggplot(aes(x = Market, y = Profits)) +\n",
    "        geom_bar(stat = \"identity\") +\n",
    "        ylab(\"Potential Profits ($1000)\") +\n",
    "        theme_bw() +\n",
    "        theme(text=element_text(size = 24))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e04c9f90-4698-42e9-bed3-a1955676ef3c",
   "metadata": {},
   "source": [
    "The Texas market is highly profitable because of the high average price, but the Ohio market is equally valuable because it has more units to sell.  The Hawaii market is not as valuable as we thought because of the low number of units available for sale.  We can think of this as the unit data weighted by average housing prices (or vice versa).  Of course, in a real market, you would expect other factors to be important as well, such as market demand.  Fortunately, there is a way to account for these additional factors... but we're getting ahead of ourselves.  Let's talk about matrices first."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaf7ab02-b2e6-4f0a-81a0-4f725d18ab6a",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59efe283-2de5-4387-b09e-29d3c4b1e224",
   "metadata": {},
   "source": [
    "### Matrices and Matrix Operations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84f23dc3-ff58-497f-97c5-204959d0e713",
   "metadata": {},
   "source": [
    "A matrix is a two-dimensional array with dimensions <i>m</i> x <i>n</i>, with <i>m</i> being the number of rows and <i>n</i> being the number of columns.  If <i>m</i> = <i>n</i>, then we have a <b>square matrix</b>.  If the left-to-right diagonal values of a square matrix are 1's and all other values are 0, we have an <b>identity matrix</b>."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1b24063-2fa7-49d3-b803-2e593868f0a5",
   "metadata": {},
   "source": [
    "3x5 Matrix: $$\\begin{bmatrix}1 & 4 & 7 & 2 & 6 \\\\ 2 & 7 & 2 & 2 & 8 \\\\ 4 & 9 & 7 & 7 & 1 \\end{bmatrix}$$\n",
    "4x4 Square Matrix: $$\\begin{bmatrix} 6 & 4 & 3 & 7 \\\\ 2 & 1 & 1 & 4 \\\\ 9 & 1 & 3 & 8 \\\\ 6 & 3 & 2 & 1 \\end{bmatrix}$$ \n",
    "4x4 Identity Matrix: $$\\begin{bmatrix} 1 & 0 & 0 & 0 \\\\ 0 & 1 & 0 & 0 \\\\0 & 0 & 1 & 0 \\\\ 0 & 0 & 0 & 1 \\end{bmatrix}$$ "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "262b3313-301a-415c-a93d-38d453abb135",
   "metadata": {},
   "source": [
    "A <b>sparse matrix</b> is a matrix which a significant number of the elements are 0.  Omics data sets often produce sparse matrices."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4cb8935-ac11-49c3-a1b9-0e7cb4875dc1",
   "metadata": {},
   "source": [
    "Transposing an <i>m</i> x <i>n</i> matrix results in an <i>n</i> x <i>m</i> matrix in which the matrix is \"flipped\" on its side."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79edeec9-3926-4cb1-afef-3677ee7eda29",
   "metadata": {},
   "source": [
    "$$A = \\begin{bmatrix}1 & 4 & 7 & 2 & 6 \\\\ 2 & 7 & 2 & 2 & 8 \\\\ 4 & 9 & 7 & 7 & 1 \\end{bmatrix}$$\n",
    "\n",
    "$$A^T = \\begin{bmatrix}1 & 2 & 4 \\\\ 4 & 7 & 9 \\\\ 7 & 2 & 7 \\\\ 2 & 2 & 7 \\\\ 6 & 8 & 1 \\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be43acf4-5044-409d-bd36-8515b5f2fb10",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Note:</b> By convention, we use lower case letters to denote a vector and upper case letters to denote a matrix.  The rows of an <i>m</i> x <i>n</i> matrix are <i>n</i>-vectors (<b>row vectors</b>) and the columns are <i>m</i>-vectors (<b>column vectors</b>).\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9c1dc50-f3e5-4c5c-8be2-f6bbb97cfd3a",
   "metadata": {},
   "source": [
    "Matrices of the same size can be added together and a matrix can be multiplied by a scalar value."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4aefd0e5-7b43-410f-8261-7b3bfedeeb29",
   "metadata": {},
   "source": [
    "$$A = \\begin{bmatrix}1 & 4 & 7 & 2 & 6 \\\\ 2 & 7 & 2 & 2 & 8 \\\\ 4 & 9 & 7 & 7 & 1 \\end{bmatrix}B = \\begin{bmatrix}4 & 6 & 4 & 1 & 0 \\\\ 0 & 1 & 0 & 3 & 1 \\\\ 2 & 5 & 3 & 4 & 1 \\end{bmatrix}$$\n",
    "\n",
    "$$A + B = \\begin{bmatrix}5 & 10 & 11 & 3 & 6 \\\\ 2 & 8 & 2 & 5 & 9 \\\\ 6 & 14 & 10 & 11 & 2 \\end{bmatrix}$$\n",
    "\n",
    "$$2 * A = \\begin{bmatrix}2 & 8 & 14 & 4 & 12 \\\\ 4 & 14 & 4 & 4 & 16 \\\\ 8 & 18 & 14 & 14 & 2 \\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4a67626-a04c-40ef-9a6e-0cc2fd2e72bd",
   "metadata": {},
   "source": [
    "An <i>m</i> x <i>n</i> matrix can be multiplied by an <i>n</i>-vector resulting in an <i>m</i>-vector."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01a186b4-f0b2-4293-a83c-b2ebdf678645",
   "metadata": {},
   "source": [
    "$$A = \\begin{bmatrix}a_{11} & a_{12} & \\cdots & a_{1n} \\\\ a_{21} & a_{22} & \\cdots & a_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ a_{m1} & a_{m2} & \\cdots & a_{mn}\\end{bmatrix} x = \\begin{bmatrix} x_1 \\\\ x_2 \\\\ \\vdots \\\\ x_n \\end{bmatrix}$$\n",
    "\n",
    "$$y = A * x = \\begin{bmatrix} x_1 * a_{11} + x_2 * a_{12} + \\cdots + x_n * a_{1m} \\\\ x_1 * a_{21} + x_2 * a_{22} + \\cdots + x_n * a_{2n} \\\\ \\vdots \\\\ x_1 * a_{m1} + x_2 * a_{m2} + \\cdots + x_m * a_{mn} \\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3af01a2e-5907-456c-ad58-b903cea9ef1f",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Note:</b> Each element of the vector y is the inner product of the corresponding row vector from A and x.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d07b434-4c08-42d4-a6e1-05300a809cc9",
   "metadata": {},
   "source": [
    "Finally, matrices can be multiplied by each other if they are compatible.  An <i>m</i> x <i>p</i> matrix multiplied by an <i>p</i> x <i>n</i> matrix results in an <i>m</i> x <i>n</i> matrix as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af27030a-94b4-4098-a322-ee3a21ac0ef1",
   "metadata": {},
   "source": [
    "$$A = \\begin{bmatrix}a_{11} & a_{12} & \\cdots & a_{1p} \\\\ a_{21} & a_{22} & \\cdots & a_{2p} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ a_{m1} & a_{m2} & \\cdots & a_{mp}\\end{bmatrix} B = \\begin{bmatrix}b_{11} & b_{12} & \\cdots & b_{1n} \\\\ b_{21} & b_{22} & \\cdots & b_{2n} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ b_{p1} & b_{p2} & \\cdots & b_{pn}\\end{bmatrix}$$\n",
    "\n",
    "$$A * B = \\begin{bmatrix} a_{11} * b_{11} + a_{12} * b_{21} + \\cdots + a_{1p} * b_{p1} & a_{11} * b_{21} + a_{12} * b_{22} + \\cdots + a_{1p} * b_{p2} & \\cdots & a_{11} * b_{1n} + a_{12} * b_{2n} + \\cdots + a_{1p} * b_{pn} \\\\ a_{21} * b_{11} + a_{22} * b_{21} + \\cdots + a_{2p} * b_{p1} & a_{21} * b_{21} + a_{22} * b_{22} + \\cdots + a_{2p} * b_{p2} & \\cdots & a_{21} * b_{1n} + a_{22} * b_{2n} + \\cdots + a_{2p} * b_{pn} \\\\ \\cdots & \\cdots & \\ddots & \\vdots \\\\ a_{m1} * b_{1n} + a_{m2} * b_{2n} + \\cdots + a_{mp} * b_{pn} & a_{m1} * b_{2n} + a_{m2} * b_{2n} + \\cdots + a_{mp} * b_{pn} & \\cdots & a_{m1} * b_{1n} + a_{m2} * b_{2n} + \\cdots + a_{mp} * b_{pn} \\\\ \\end{bmatrix}$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ee0afe0-bfaf-4448-af5a-6f1268bbd002",
   "metadata": {},
   "source": [
    "$$A = \\begin{bmatrix}1 & 4 & 7 & 2 & 6 \\\\ 2 & 7 & 2 & 2 & 8 \\\\ 4 & 9 & 7 & 7 & 1 \\end{bmatrix} B = \\begin{bmatrix}0 & 0 & 0 & 0 & 1 \\\\ 0 & 0 & 0 & 1 & 0 \\\\ 0 & 0 & 1 & 0 & 0\\\\ 0 & 1 & 0 & 0 & 0 \\\\ 1 & 0 & 0 & 0 & 0 \\end{bmatrix}$$\n",
    "\n",
    "$$A * B = \\begin{bmatrix} 6 & 2 & 7 & 4 & 1\\\\ 8 & 2 & 2 & 7 & 2 \\\\ 1 & 7 & 7 & 9 & 4\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dad45c0a-3f92-4b0c-8e05-986d87a58b0d",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Note:</b> Each element of the matrix A * B is the inner product of the corresponding row vector from A and column vector from B.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19dacf4b-cc36-4392-94e0-dca5fe30d267",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>&#9995; Tip:</b> Did you see what we did there with B?</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc89f325-fcc0-400a-b39d-6e47adc938fa",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe3d2641-d2b0-4d6b-b44d-1e936e976634",
   "metadata": {},
   "source": [
    "### Principles of Linear Models"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4641480a-7c77-4122-9bc8-d8a5101921d9",
   "metadata": {},
   "source": [
    "If you think back to high school algebra, you might remember the standard equation of a line."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec2bb8ba-316c-480f-8815-84db19563e8c",
   "metadata": {},
   "source": [
    "$$y = bx + a$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5bce4007-17f2-4467-81a3-ea1543bf3f3a",
   "metadata": {},
   "source": [
    "In this equation, y is the <b>dependent variable</b> (also called the <b>response variable</b>) and x is the <b>independent variable</b> (also called the <b>explanatory variable</b>  or <b>covariate</b>).  In other words, the value of y depends on the vale of x.  In practical terms, x would usually be a measurement of some kind with y calculated from that measurement.  If we were to plot x vs y, the value a would be the <b>intercept</b> (i.e. the value of y when x = 0) and b is the <b>slope</b> (i.e. the change in y associated with a one unit change in x).  Let's plot some real data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c6e0797-db51-472f-905e-d07f33526b7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "x <- 0:10\n",
    "y <- 2 * x + 3\n",
    "data.frame(x = x, y = y) %>%\n",
    "    ggplot(aes(x = x, y = y)) + # sets basic scatter plot parameters\n",
    "        geom_point(size = 5, color = \"blue\") + # plots points\n",
    "        geom_line() + # plots line\n",
    "        coord_fixed() + # sets axis to equal units\n",
    "        theme_bw() + # sets theme to black and white\n",
    "        theme(text=element_text(size = 24)) # increase font size\n",
    "\n",
    "# similar to the code above, expect we're plotting both points (geom_point) and a line (geom_line) on the same graph.\n",
    "# we fix the coordinates so that the axes are equally spaced"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7128e9c-4be8-4f3a-80fb-968194a887db",
   "metadata": {},
   "source": [
    "The line intersects the y axis at y = 3, and y increase by 2 units for every one unit increase in x."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8f16dd6-8057-451c-95ec-7f7de067af1f",
   "metadata": {},
   "source": [
    "When dealing with multiple independent variables, we often write linear equations in the following form:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfc0bf6e-ce13-4174-848f-c586d092995b",
   "metadata": {},
   "source": [
    "$$ax + by + cz = d$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47cca4f9-88a9-4c5b-a117-42931d18cf87",
   "metadata": {},
   "source": [
    "In this form, x, y and z are the dependent variables and a, b, and c are the coefficients of each variable (d is a constant).  We call this form a <b>linear combination</b>.  You might recognize this form from algebra because this is the form we use when solving for systems of linear equations."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "876d8e1c-170c-4eb4-a5a4-51f3b946afd0",
   "metadata": {},
   "source": [
    "$$ 3x - y + z = 7$$\n",
    "$$ x + 2y -2z = -7$$\n",
    "$$ x + y + z = 3$$\n",
    "$$ x = ? \\; y = ? \\;z = ?$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dcdb2ca-554f-492a-8aae-55f403a54d9d",
   "metadata": {},
   "source": [
    "A property of linear equations is that we can transform them into other linear equations by multiplying them by a scalar value, adding values to both sides of the equation, or by combining two equations.  In this way, we can isolate the individual variables and, if a unique solution exists, find the values.  See if you can solve the above equations and then use the following R code to check your answers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e9d4cad-b7b7-400a-88c0-a05a86e1b82c",
   "metadata": {},
   "outputs": [],
   "source": [
    "A <- rbind(c(3,-1,1), c(1,2,-2), c(1,1,1))\n",
    "# We weave the three vactors together into a matrix using rbind (i.e. row bind) (cbind would bind by column)\n",
    "B <- c(7,-7,3)\n",
    "# Solve requires a matrix of coefficients and a vector of constants, so we make A and B separate\n",
    "solve(A,B)\n",
    "# Solve solves the system of linear equations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b38f7b8e-f350-4a52-bc80-d1087766ecc9",
   "metadata": {},
   "source": [
    "You might notice from the R code that we're inputting the data as a matrix (via <b>rbind</b>) and a vector before sending it to <b>solve</b>.  The matrix notation is very useful for understanding what's happening."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afc794cd-d615-407c-9b43-5260a3567193",
   "metadata": {},
   "source": [
    "$$\\begin{bmatrix}3 & -1 & 1 & 7\\\\ 1 & 2 & -2 & -7\\\\ 7 & -7 & 3 & 3\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c858690-7a0b-4d20-900a-5ac58579d2d1",
   "metadata": {},
   "source": [
    "Each row of the matrix represents one of our equations.  The first three columns represent each of our variables, and the fourth column represents our constants.  Using the same rules for transforming linear equations as before, we can rewrite the matrix like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f814685-118f-41a1-8b19-31d82d57cb06",
   "metadata": {},
   "source": [
    "$$\\begin{bmatrix}1 & 0 & 0 & 1\\\\ 0 & 1 & 0 & -1\\\\ 0 & 0 & 1 & 3\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00d89484-4cc6-4671-a05e-a0bca9942188",
   "metadata": {},
   "source": [
    "Each of our first three columns are now unit vectors isolating an individual variable.  The fourth column provides the solution to those variables."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf8efa3a-bc21-4065-8d53-38a3cb7a5617",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5c965d4-166c-4a17-9e43-b1f0d7105353",
   "metadata": {},
   "source": [
    "### The Generalized Linear Model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ba3cbe4-f7dd-4000-8a08-7d3b499f25c3",
   "metadata": {},
   "source": [
    "When studying these concepts in algebra, we often focus on idealized examples that provide unique solutions.  For example, to find an exact solution for a system of linear equations, we need a number of equations equal to the number of variables.  In practice, we rarely have such clean data sets.  Measured data points rarely fall on a straight line and instead include a great deal of experimental and random noise.  In solving systems of linear equations, it's very common to have more variables than equations necessary to solve them.  In these cases, we have to approximate solutions using <b>linear models</b>.  A linear model describes one or more independent variables as a linear combination of independent variables.  We'll keep things simple in this submodule and assume a single independent variable."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "207a2464-4104-43d6-8605-38c23de36cb9",
   "metadata": {},
   "source": [
    "The Generalized Linear Model (GLM) is a statistical framework for describing a variety of common models including linear regression, analysis of variance (ANOVA), the <i>t</i>-test, logistic regression, linear mixed models, and many more.  The different models under the umbrella of the GLM are defined by three attributes:\n",
    "\n",
    "- The probability distribution function of the response variable <i><b>$Y$</b></i>\n",
    "- The linear predictor <i><b>$\\eta = X\\beta$</b></i>\n",
    "- The <b>link function</b> relating $\\eta$ to the mean $\\mu$ of the distribution function\n",
    "\n",
    "For the (unfortunately named) <b>general linear model</b> that includes linear regression, ANOVA and <i>t</i>-test, the link function is <i><b>$\\mu = X\\beta$</i></b> and we assume normal distribution of Y which gives us the general form:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "909afad4-9477-4ace-9204-1f9d5d7898e2",
   "metadata": {},
   "source": [
    "$$Y = XB + U$$\n",
    "\n",
    "where\n",
    "\n",
    "Y = Matrix of dependent variables with rows as features and columns as samples (often called a <b>count matrix</b>)<br>\n",
    "X = Matrix of independent variables (called a <b>design matrix</b> when using categorical data)<br>\n",
    "B = Estimated coefficients<br>\n",
    "U = Matrix of errors"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01aae267-757a-40f8-9f7e-94824fb88071",
   "metadata": {},
   "source": [
    "<img src=\"images/GLM.png\" width=\"600\" height=\"600\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aad59f9d-9810-42b2-857e-a75e393f0d14",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db402013-73cb-4a0b-a8d2-f0cf245e5bc3",
   "metadata": {},
   "source": [
    "### Linear Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e60e31a-937c-449d-bd58-28c761c9c855",
   "metadata": {},
   "source": [
    "<b>Linear regression</b>, <b>Analysis of Variance (ANOVA)</b> and <b><i>t</i>-test</b> are all special cases of the general linear model.\n",
    "\n",
    "- Linear regression models the relationship between one dependent variables and one or more independent variables.  <b>Simple linear regression</b> is the case for one independent variable, <b>multiple linear regression</b> for multiple dependent variables.  The dependent variable in a continuous variable the the independent variable is either continuous or categorical.\n",
    "\n",
    "- ANOVA is a special case of linear regression where the independent variable is categorical representing two or more populations.  ANOVA is designed to determine if the means of the populations significantly differ.\n",
    "\n",
    "- <i>t</i>-test is the special case of ANOVA with only two populations."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c0337e1-6e31-4a6e-9fee-63d4365f04ef",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "    <b>&#127891; Note:</b> <b>Multivariate linear regression</b> is the model used when considering multiple dependent variables.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af3f2579-70a8-4565-87d4-51647f4d2997",
   "metadata": {},
   "source": [
    "<img src=\"images/Regression-ANOVA-ttest.png\" width=\"600\" height=\"600\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04c04ae2-31f9-4511-8908-82d9b2ca7116",
   "metadata": {},
   "source": [
    "The rationale behind regression models is as follows.  Consider a dataset modeled with two variables.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78be710f-137b-4b5a-89ef-f42a1757b4b7",
   "metadata": {},
   "source": [
    "<img src=\"images/Regression 1.png\" width=\"400\" height=\"400\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f6df3ea-f196-426f-92bf-2c228281e5ce",
   "metadata": {},
   "source": [
    "We can see from the scatterplot that the data points roughly follow a line.  However, they do not exactly fit a line because of the introduction of noise.  The source of the noise could be some property of the data or it could just be random variability.  We can pick any arbitrary line to fit the data, but how do we know which line is best?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80edf372-1dd2-40df-b148-1f9981040783",
   "metadata": {},
   "source": [
    "<img src=\"images/Regression 2.png\" width=\"400\" height=\"400\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c849fc0c-ec32-4360-ae9b-5b6d458381d1",
   "metadata": {},
   "source": [
    "Let's start by looking at our regression model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03a41456-ef47-422c-b22d-aaed63352e76",
   "metadata": {},
   "source": [
    "$$y = X_i^T\\beta + \\epsilon$$\n",
    "\n",
    "where\n",
    "\n",
    "$$y = \\begin{bmatrix}y_1 \\\\ y_2 \\\\ \\vdots \\\\ y_i\\end{bmatrix}$$\n",
    "\n",
    "$$X = \\begin{bmatrix}x_1^T \\\\ x_2^T \\\\ \\vdots \\\\ x_p^T\\end{bmatrix} = \\begin{bmatrix}x_11 & x _12 & \\cdots & x_1p \\\\ x_21 & x _22 & \\cdots & x_2p \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ x_n1 & x _n2 & \\cdots & x_np \\\\\\end{bmatrix}$$\n",
    "\n",
    "$$\\beta = \\begin{bmatrix}\\beta_1 \\\\ \\beta_2 \\\\ \\vdots \\\\ \\beta_p\\end{bmatrix}$$\n",
    "\n",
    "$$\\epsilon = \\begin{bmatrix}\\epsilon_1 \\\\ \\epsilon_2 \\\\ \\vdots \\\\ \\epsilon_i\\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff83346e-8d7b-4a6a-9757-096e954aa1df",
   "metadata": {},
   "source": [
    "Let's look at the terms in detail.  __y__ is our dependent variable.  In omics, this is typically a count matrix of features (rows) and samples (columns), with the features being genes, transcripts, proteins, or whatever your particular omics experiment is measuring.  __X__ is the matrix (also called the <b>design matrix</b>) of our independent variables (we'll call them covariates from this point forward) with each column vector representing one of our covariates.  For categorical covariates, these values are replaced with dummy variables (0 or 1).  We'll look at the design matrix in more detail below.  $\\beta$ is the matrix of our coefficients that are calculated for each covariate.  $\\epsilon$ is the matrix of our errors.  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e9139a1-f58e-4276-af78-8a386a107dfd",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Note:</b> If you think the term $y = X_i^T\\beta$ looks like an inner product, you're right.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "974034b3-45c8-4b1d-93ce-de676d2805c0",
   "metadata": {},
   "source": [
    "Now we need a way to evaluate how well a given line fits our data.  We do this using the <b>method of least squares</b>."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7d0132c-adfb-451d-b22f-1f36a58fb93a",
   "metadata": {},
   "source": [
    "<img src=\"images/Regression 3.png\" width=\"400\" height=\"400\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7bf5473-0a15-440f-b1fd-47f2a6608d03",
   "metadata": {},
   "source": [
    "For each data point, we calculate the y-axis offset from the predicted line, square the results (to remove negative values), and then add them together to get the <b>sum of squares</b>.  The line that gives us the minimal sum of squares is our best fit line and ultimately our values of $\\beta$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aabb4958-0a9b-4fa2-9f2c-adefc534b09c",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Note:</b> Without going into the math, our regression model provides a function that allows us to do this calculation easily.  Basically, we're looking for the case where the derivative of that function equals 0, and from that function we calculate our values of $\\beta$.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9abf67c4-0754-4090-a0d1-0cc078e4772a",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c37d4f8b-d5c5-4ba4-8096-02b8b08e0477",
   "metadata": {},
   "source": [
    "### The Design Matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "201772c7-85e8-4d59-ba5a-beda91d8298f",
   "metadata": {},
   "source": [
    "Let's consider the following experiment that has one covariate with four levels."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b1cc513-39c1-43b3-803d-d38bb89f49f1",
   "metadata": {},
   "source": [
    "$$\n",
    "\\begin{aligned}\n",
    "& \\text {Experiment 1. Single treatment vs. control, sham and placebo }\\\\\n",
    "&\\begin{array}{cccc}\n",
    "\\hline \\hline \\text { Subject } & \\text { Treatment State } \\\\\n",
    "\\hline 1 & Control \\\\\n",
    "2 & Control \\\\\n",
    "3 & Sham \\\\\n",
    "4 & Sham \\\\\n",
    "5 & Placebo \\\\\n",
    "6 & Placebo \\\\\n",
    "7 & Treatment \\\\\n",
    "8 & Treatment \\\\\n",
    "\\hline\n",
    "\\end{array}\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59337d9d-504c-4d26-8362-5445df22b90b",
   "metadata": {},
   "source": [
    "Our model will look like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "955ae695-47ba-4e0c-9775-bdd2df52aecb",
   "metadata": {},
   "source": [
    "$$y = \\beta_0 + \\beta_1 x_{Sham} + \\beta_2 x_{Placebo} + \\beta_3 x_{Treatment} + \\epsilon$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "860da6cf-0f10-4e5d-b59c-6ff20db9b55f",
   "metadata": {},
   "source": [
    "Since we can't put the category names into our regression model, we replace them with dummy variables of 0's and 1's and our design matrix looks like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94fd351e-770f-4a34-80fa-da12270dfdf1",
   "metadata": {},
   "source": [
    "$$X = \\begin{bmatrix}1 & 0 & 0 & 0 \\\\ 1 & 0 & 0 & 0 \\\\ 1 & 1 & 0 & 0 \\\\ 1 & 1 & 0 & 0 \\\\ 1 & 0 & 1 & 0 \\\\ 1 & 0 & 1 & 0 \\\\ 1 & 0 & 0 & 1 \\\\ 1 & 0 & 0 & 1 \\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d6045ef-7b94-46ea-a366-4d7944976bea",
   "metadata": {},
   "source": [
    "The four columns represent our four covariates in the order shown above.  The first column (Control) is set as the intercept and is all 1's.  If you look across the first row (representing our first sample), this model says \"Calculate $\\beta$ for Sham holding Placebo and Treatment constant\".  When we run this model, we calculate our four $\\beta$'s.  $\\beta_0$ represents the <b>group mean</b> for the control samples.  The other $\\beta$'s are the offset for that covariate from the mean of Control.  So the group mean of Treatment would be $\\beta_0 + \\beta_3$."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbd49fee-a73f-466d-a45a-80842b050c01",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>&#127891; Note:</b> You might sometimes see $\\beta_0$ called $\\mu$ and the other $\\beta$'s as $\\tau$.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2a4631f-3756-4620-b58b-d1e6e3381206",
   "metadata": {},
   "source": [
    "A more common model used in omics data analysis (and in ANOVA) is the <b>cell means model</b>.  In this case, we set the intercept to zero and each $\\beta$ is the group mean of the respective covariate.  The design matrix in this case would look like this:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86eb2a4d-1440-4a69-941d-155b0f9becca",
   "metadata": {},
   "source": [
    "$$X = \\begin{bmatrix}1 & 0 & 0 & 0 \\\\ 1 & 0 & 0 & 0 \\\\ 0 & 1 & 0 & 0 \\\\ 0 & 1 & 0 & 0 \\\\ 0 & 0 & 1 & 0 \\\\ 0 & 0 & 1 & 0 \\\\ 0 & 0 & 0 & 1 \\\\ 0 & 0 & 0 & 1 \\end{bmatrix}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75a764bc-2797-459a-bdc6-ce7490c45fd0",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65e47a64-b7b3-43c6-879a-5bc3e92dbd9d",
   "metadata": {},
   "source": [
    "### Example"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "024c1e14-397f-4714-827c-e43f87eb7323",
   "metadata": {},
   "source": [
    "R has many built-in datasets that are useful for testing functions.  Let's look at the classic <b>iris</b> dataset.  This dataset classifies different species of iris based on four cofactors, Sepal Length, Sepal Width, Petal Length and Petal Width."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45bea49c-5ca0-4d0c-b4e8-b94afa785fe4",
   "metadata": {},
   "outputs": [],
   "source": [
    "iris #iris is a built-in R data set and can be called from anywhere using the iris function\n",
    "summary(iris) # calculates basic summary statistics (counts, mean, median, etc.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eefb7f34-2964-48d6-9b7f-e313eb0ac479",
   "metadata": {},
   "outputs": [],
   "source": [
    "iris_scatter <- iris %>%\n",
    "    ggplot(aes(x = Sepal.Length, y = Petal.Length)) +\n",
    "        geom_point(size = 3, aes(color = Species, shape = Species)) +\n",
    "        xlab(\"Sepal Length\") +\n",
    "        ylab(\"Petal Length\") +\n",
    "        theme_bw() +\n",
    "        theme(text=element_text(size = 24))\n",
    "\n",
    "# basic scatter plot (geom_point) that sets the colors and shape of the data points by Species\n",
    "\n",
    "iris_scatter + geom_smooth(method=\"lm\")\n",
    "\n",
    "# adds trend line (geom_smooth) based on linear regression (method=\"lm\") using the default formula y ~ x"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aaaa6015-49e0-45b4-b07e-ffb7af9bf0f0",
   "metadata": {},
   "source": [
    "We're going to run a simple regression model comparing Petal Length to Sepal Length, both of which are continuous variables. We'll use the <b>lm</b> function of R which is used for basic linear regression models.  __lm__ takes a formula of the form <i>Response Variable ~ Covariates</i>.  The <b>data</b> argument indicates that __lm__ should look for the Petal.Length and Sepal.Length columns in iris."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba6d2900-5dea-42c2-913e-27229088b8d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "fit1 <- lm(Petal.Length ~ Sepal.Length, data = iris)\n",
    "summary(fit1)\n",
    "\n",
    "# lm carries out basic linear regression using a formula (e.g. Petal.Length ~ Sepal.Length)\n",
    "# The formula is of the form y ~ x, where y is the response variable and x is the covariate\n",
    "# summary summarizes the regression data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16903dde-7449-43a8-b58a-27331eb4c3c7",
   "metadata": {},
   "source": [
    "The coefficient returned for Sepal.Length suggests that for every 1 unit increase in Sepal.Length, there is a 1.858 increase in Petal.Length.  This is consistent with what we see in the scatterplot.  However, we can also see from the scatterplot that the relationship between the two variables differs between species.  versicolor and virginia show strong positive trends, but setosa does not.  We can also see this with the <b>Adjusted R-squared</b> value (0.76) which gives us an indication of the strength of the correlation between the two variables (1.0 being perfect correlation).  Let's extend our model to include species."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4356b620-725c-4e17-b3ff-1511cc8c1d30",
   "metadata": {},
   "outputs": [],
   "source": [
    "iris_scatter <- iris %>%\n",
    "    ggplot(aes(x = Sepal.Length, y = Petal.Length)) +\n",
    "        geom_point(size = 3, aes(color = Species, shape = Species)) +\n",
    "        xlab(\"Sepal Length\") +\n",
    "        ylab(\"Petal Length\") +\n",
    "        geom_smooth(method=\"lm\", aes(fill=Species)) +\n",
    "        theme_bw() +\n",
    "        theme(text=element_text(size = 24))\n",
    "iris_scatter\n",
    "\n",
    "# We modify geom_smooth with an aesthetic (aes(fill=Species)) to fit each Species separately"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21521504-0018-4649-b82a-5216f450cf10",
   "metadata": {},
   "outputs": [],
   "source": [
    "fit2 <- lm(Petal.Length ~ Sepal.Length + Species, data = iris)\n",
    "summary(fit2)\n",
    "\n",
    "# We modify the formula to the form y ~ x1 + x2 to indicate the second covariate assuming no interaction effects"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc363f12-9ef4-470e-be56-153aef12bef2",
   "metadata": {},
   "source": [
    "We've added the categorical variable <b>Species</b> to our model and now the coefficient has been split into coefficients describing the effects of Sepal.Length and Species.  How do we interpret this model?  The Sepal.Length coefficient says that for every 1 unit increase in Sepal.Length, Petal.Length increases by 0.63 ignoring species.  For species, versicolor and virginica show a 2.2 and 3.1 increase (respectively), compared to setosa assuming Sepal.Length is held constant.  The Adjusted R-squared of our model has improved to 0.97."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fb4c77b-b533-4cb4-b368-51f81e04ad03",
   "metadata": {},
   "outputs": [],
   "source": [
    "fit3 <- lm(Petal.Length ~ 0 + Sepal.Length + Species, data = iris)\n",
    "summary(fit3)\n",
    "\n",
    "# Modifying the formula to the form y ~ 0 + x1 + x2 forces the intercept through the origin"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ef0ecf7-6119-40ba-b01e-77e9da6c6880",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f52019b-77ba-4ce2-9a7b-495fd616fbff",
   "metadata": {},
   "source": [
    "### Logistic Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d597f048-7786-4cea-8306-018de080ca16",
   "metadata": {},
   "source": [
    "There are circumstances where linear models do not fit the data well.  Consider a case where we have a binary variable such as Control vs. Treatment, True vs. False, Healthy vs. Disease, etc.  Such variables can be thought of as <b>binary classification schemes</b>.  In other words, can we accurately classify data into correct category?  If we try to fit a straight line to this data, the answer is no.  To demonstrate this, let's look at R's <b>mtcars</b> data which gives information about a variety of car models.  We will specifically look at the effects of transmission type (<b>am</b>, 0 = automatic, 1 = manual) on mileage (<b>mpg</b>)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1c76dfa-443e-4d83-add0-0e3684732334",
   "metadata": {},
   "outputs": [],
   "source": [
    "mtcars # like iris, mtcars is a built-in R data set and can be called from anywhere using the mtcars function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9001e6ae-57d2-409b-a42c-296c4ac21ded",
   "metadata": {},
   "outputs": [],
   "source": [
    "mtcars %>%\n",
    "     ggplot(aes(x = mpg, y = am)) +\n",
    "     geom_point() +\n",
    "     geom_smooth(method = \"lm\") +\n",
    "     labs(\n",
    "         title = \"Linear Regression\", \n",
    "         x = \"mpg\",\n",
    "         y = \"Transmission (0 = automatic, 1 = manual)\"\n",
    "     )\n",
    "\n",
    "# Nothing new here except changing the input data frame and the x and y values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60c06ada-7dbc-4446-87d9-bc8f1cf029cf",
   "metadata": {},
   "source": [
    "A better model is </b>logistic regression</b>.  Logistic regression falls under the umbrella of the generalized linear model but differs in two regards:\n",
    "\n",
    "- The distribution is binomial instead of normal\n",
    "- The link function is the logit function instead of the identity function\n",
    "\n",
    "The <b>logit function</b> (or <b>logistic function</b>) is a sigmoidal curve calculated as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b82548b1-fbc0-424c-92c1-145fadf0f1ba",
   "metadata": {},
   "source": [
    "$$p(x) = \\frac{1}{1 + e^{-(\\beta_0 + \\beta_1 x)}}$$\n",
    "\n",
    "where\n",
    "\n",
    "$p(x)$ = Probability of success based on the <b>log odds</b> (i.e. $ln\\frac{\\text{the probability of success}}{\\text{probability of failure}}$).\n",
    "\n",
    "$\\beta_0$ = The intercept, representing the log odds when x=0\n",
    "\n",
    "$\\beta_1$ = Coefficient describing how the probability changes for each unit increase in x"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc8d9162-e4f1-4264-92fc-88da57b753ef",
   "metadata": {},
   "source": [
    "If we now model our data using the logit function we get:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ab5fba2-879c-48a8-8647-72634c801010",
   "metadata": {},
   "outputs": [],
   "source": [
    "mtcars %>%\n",
    "     ggplot(aes(x = mpg, y = am)) +\n",
    "     geom_point() +\n",
    "     geom_smooth(method = \"glm\", method.args = list(family = \"binomial\")) +\n",
    "     labs(\n",
    "         title = \"Logistic Regression\", \n",
    "         x = \"mpg\",\n",
    "         y = \"Transmission (0 = automatic, 1 = manual)\"\n",
    "     )\n",
    "\n",
    "# To get the logistic curve, we change the geom_smooth method argument to \"glm\" and use the method.args argument to set the distribution to \"binomial\"\n",
    "# y is our binomial variable of success or failure (e.g. 1 or 0).  In this case, \"success\" means manual transmission, \"failure\" means automatic transmission."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ad24d13-495f-485f-92fc-b11adc98b83f",
   "metadata": {},
   "source": [
    "Before we perform a formal logistic regression, let's look at this plot.  For any given value of x, imagine a vertical line that divides the data into automatic (0) vs. manual (1).  At x=0, all data points are classified as manual.  At x = $\\infty$, all data points are classified as automatic.  At any point in between, some proportion of the data points will be classified as manual and the rest as automatic.  The number of data points that are correctly classified (true positives and true negatives) or incorrectly classified (false positives and false negatives) will change based upon the value of x.  At some value of x, we maximize the number of true positives and minimize the number of false positives.  Alternatively, you can say that at a given probability of y, all data points that yield a probability > y will be considered a success (i.e. manual transmission).  We can see from our graph that the inflection point p = 0.5 corresponds to an fuel efficiency of ~21-22 mpg.\n",
    "\n",
    "Let's now run the logistic regression in R using the <b>glm</b> function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b75f1bbd-41c5-42ae-9b4f-ce701587d5de",
   "metadata": {},
   "outputs": [],
   "source": [
    "mtcars_logit <- glm(am ~ mpg, data = mtcars, family = \"binomial\")\n",
    "summary(mtcars_logit)\n",
    "exp(coef(mtcars_logit))\n",
    "\n",
    "# We use the glm package to perform logistic regression using a formula y ~ x and setting the distribution to \"binomial\"\n",
    "# coef extracts the coefficients from the regression model and exp exponentiates (i.e. e^x) the value to give us the odds ratio"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c052d08-67ba-48f2-8c63-93f52be1683b",
   "metadata": {},
   "source": [
    "In linear regression, the coefficient represents a linear increase in _y_ for a 1 unit increase in _x_.  For logistic regression, the coefficient represents an increase in the log odds for a 1 unit increase in _x_.  So a 1 mpg increase in fuel efficiency represents a 0.3 increase in the log odds that the car in question has a manual transmission.  In other words, increasing the fuel efficiency by 1 mpg increases the odds that the car has a manual transmission by 1.35."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "efc105b5-d4f1-4c36-b260-eba3faf62815",
   "metadata": {},
   "source": [
    "How do we evaluate the quality of the logistic regression?  Remember that at different points on the graph, some proportion of the points will be correctly classified and the rest will not.  We can break these values down into true positives (correctly classified as manual), true negatives (correctly classified as automatic), false positives (incorrectly classified as manual) and false negatives (incorrectly classified as automatic).  We can further define the concepts of <b>specificity</b> and <b>sensitivity</b>."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f369d10-c28b-4e98-a01e-9de001ff2621",
   "metadata": {},
   "source": [
    "$$\\text{sensitivity} = \\text{true positive rate} = \\frac{\\text{true positives}}{\\text{total positives}}$$\n",
    "\n",
    "$$\\text{specificity} = \\text{true negative rate} = \\frac{\\text{true negatives}}{\\text{total negatives}}$$\n",
    "\n",
    "$$\\text{false positive rate} = 1 - \\text{specificity}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd7ee52f-ee51-49b1-91e0-07e08df7d7be",
   "metadata": {},
   "source": [
    "With this data we can plot a <b>receiver operating characteristic</b> curve to see the trade-off between the true positive and false positive rates in our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4a7b885-d239-4ea5-a4e5-995e8b7e2206",
   "metadata": {},
   "outputs": [],
   "source": [
    "roc(am ~ mpg, mtcars)\n",
    "\n",
    "# The pROC package is used to generate ROC curves, in this case based on a logistic model\n",
    "# The roc function in its most basic form returns some basic information including the area under the curve (AUC)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a518083-0851-41c5-a416-d07eb42c3f9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "roc(mtcars$am, mtcars$mpg, percent=FALSE,\n",
    "    ci=TRUE, boot.n=100, ci.alpha=0.9, stratified=FALSE,\n",
    "    plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE,\n",
    "    print.auc=TRUE, show.thres=TRUE)\n",
    "\n",
    "# We can also use roc to plot the curve\n",
    "# The first line of arguments indicate the response (am) and predictor (mpg) variables and indicates the values should not be plotted as percentages\n",
    "# The second line calculates confidence intervals and sets the arguments for that calculation\n",
    "# The remaining lines indicates the curve should be plotted and sets graphical parameters and prints the AUC value on the plot"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a239a75-4629-4cc1-8ce4-f465e6efbc3a",
   "metadata": {},
   "source": [
    "What does this curve mean?  If we start at sensitivity = 0, all values are classified as manual transmission, so the false positive value is maximum.  As sensitivity increases, manual transmission cars are increasingly classified correctly.  However, at the inflection point of the graph, we begin to see a drastic increase in the number of manual transmission cars incorrectly classified as automatic."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26feaa5a-1713-4065-9e7c-54459ad0eebe",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>&#9995; Tip:</b> The R package <b>pROC</b> that we used to generate the curve directly plots sensitivity vs. specificity, which is why the x axis goes from 100 to 0.  You will also see ROC curves plotted as sensitivity vs. 1 - specificity which gives a more traditional 0 to 100 x axis.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b57004c2-3daa-41b6-8822-b85869bdcbe0",
   "metadata": {},
   "source": [
    "<b>AUC</b> is the <b>Area Under the Curve</b> and is the number we usually look at to evaluate the curve and represents the probability that a point will be correctly classified.  At AUC = 0.5, there is no distinction between the data points (i.e. they can't be classified) and AUC = 1 indicates perfect classification of the data.  AUC = 0.830 isn't bad but indicates the model could probably be improved."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "844ac3f6-9512-4112-b195-c68a84bfe755",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76d74ca0-77b0-4324-8134-f79b1d229a6b",
   "metadata": {},
   "source": [
    "### Further Exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e272fa7-0f7d-4485-912f-b066f399438b",
   "metadata": {},
   "source": [
    "The material included in this submodule is only a fraction of the concepts covered by linear algebra.  We have included useful references below for those who wish to explore the topics in more detail.  Specific topics we recommend to users:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16740d11-1083-4615-9144-ef0f81dc637a",
   "metadata": {},
   "source": [
    "#### Experimental Design"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0659b5a0-8a65-4e2b-b6f9-2b1dde258621",
   "metadata": {},
   "source": [
    "We often think of the scientific method as a circular process involving generating a hypothesis, designing experiments, running experiments, analyzing results, and testing the hypothesis.  In practice, experimental design and data analysis are intimately linked.  When designing your experiments, you are in essence having a conversation with your future self on how best to design the experiment to get the data you will need.  Factors to consider include:\n",
    "\n",
    "- What analytical methods will you need to use?\n",
    "- What covariates will you include in my model?  Do you expect interaction effects?\n",
    "- What confounding variables could affect your results?  Will you account for them in the design stage or in the analysis stage?\n",
    "- What controls will you need to account for background noise or to make appropriate comparisons?\n",
    "- How many replicates will you need?  What significance or effect size do you expect?  Have you performed a power analysis?\n",
    "\n",
    "No amount of data analysis will compensate for bad experimental design, so taking time at the beginning to carefully plan your experiments will save you a lot of time and effort during the data analysis step."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a33085e0-35d3-489c-bc9e-4aec94299337",
   "metadata": {},
   "source": [
    "#### Advanced Models"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d131f07-735a-41cb-becb-328128989221",
   "metadata": {},
   "source": [
    "We covered simple linear and logistic regression models with a single independent variable and multiple regression models with a single covariate with multiple levels or with multiple covariates.  The more covariates you add to your model, the more complicated interpretation of the model becomes.  The type of covariate you introduce will also affect the analysis method you choose and the interpretation of the data.  Some such models are:\n",
    "\n",
    "- __Multi-factorial Models__ - When working with multiple covariates each with multiple levels, factorial designs that compare all combinations of the factors are common.  Factorial designs are closely related to multiple regression and are very useful in analyzing interaction effects, that is, effects that are introduced when covariates are not fully independent of each other (e.g. height and weight).\n",
    "- __Repeated-Measures (Longitudinal) Models__ - Often we want to take measurements from the same sample at multiple time points to determine the change in the measurement over time (as opposed to cross-sectional studies that collect measurements at a single time point).  Longitudinal studies can be used to study cohorts of similar individuals, to study effects in a single individual, or to study short-term vs. long-term effects.  Repeated-measures affects are common in ANOVA but care must be taken to account for inflated false positive rates.\n",
    "- __Multivariate Models__ - Sometimes we deal with systems with multiple dependent variables.  This situation is common with high-dimensional data sets such as we encounter with omics data sets.  Most of the methods we've discussed in this submodule have multivariate methods (e.g. multivariate linear/logistic regression, MANOVA, MANCOVA, etc.).  Other multivariate methods such as principle components analysis (PCA) will be discussed in <b>Submodule 8: Introduction to Exploratory Analysis</b>.\n",
    "- __Linear Mixed Models__ - A variation of linear models, LMM's treat covariates as fixed or random variables.  A fixed variable is what we usually think of when talking about covarites (i.e. something we measure), while a random variable represents random properties of a population (i.e. an effect of a group of mice living in the same cage).  LMM's are very useful for dealing with batch effects and other confounding variables.\n",
    "- __Non-Linear Models__ - Not every set of variables show linear relationships.  For example, population growth tends to follow a logistic pattern.  Non-linear models are useful for dealing with these situations."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1c28be0-53fc-4224-a3e5-b46ed241ee7a",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79adc974",
   "metadata": {},
   "source": [
    "<p><span style=\"font-size: 30px\"><b>Quizzes</b></span> <span style=\"float : inline;\">(run the command below to display the quizzes)</span> </p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d966505e",
   "metadata": {},
   "outputs": [],
   "source": [
    "IRdisplay::display_html('<iframe src=\"quizes/Chapter3_Quizes.html\" width=100% height=450></iframe>')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2fd3e07",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7c1cca4",
   "metadata": {},
   "outputs": [],
   "source": [
    "sessionInfo()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8de584c",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04d28f46-44d1-4ad9-987f-8646f84a0663",
   "metadata": {},
   "source": [
    "## Conclusions\n",
    "\n",
    "Aside from basic statistics, linear algebra may be the the most useful topic bioinformatians and data sciences can learn. Linear models are fundamental to many processes in math and statistics and a deeper understanding of how models work will aid you at every step of the scientific method. In particular, linear models are fundamental to:\n",
    "\n",
    "+ Many downstream omics data analyses such as differential analysis, pathway/gene set enrichment analysis, and exploratory analysis.\n",
    "+ Exploratory analysis methods such as principle components analysis\n",
    "+ Many types of machine learning algorithms including regression models and neural networks.\n",
    "\n",
    "An understanding of linear models also provides a foundation for understanding non-linear models which are used in methods such as t-SNE. We encourage you to explore these topics in more detail on your own."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd878da4",
   "metadata": {},
   "source": [
    "## Clean up\n",
    "\n",
    "Remember to move to the next notebook or shut down your instance if you are finished."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92f8c3ad-1bd5-48cf-b620-3fc037912fdc",
   "metadata": {},
   "source": [
    "<div style=\"display: flex; justify-content: center; margin-top: 20px; width: 100%;\"> \n",
    "    <div style=\"display: flex; justify-content: space-between; width: 50%;\"> \n",
    "        <div> \n",
    "            <a href=https://github.com/NIGMS/Analysis-of-Biomedical-Data-for-Biomarker-Discovery/blob/master/GoogleCloud/Submodule02_Intro_to_R_Data_Structures.ipynb#overview>Previous section</a>                                            \n",
    "        </div> \n",
    "        <div> \n",
    "            <a href=\"#top3\">Top of this page</a>                                                      \n",
    "        </div> \n",
    "        <div> \n",
    "            <a href=https://github.com/NIGMS/Analysis-of-Biomedical-Data-for-Biomarker-Discovery/blob/master/GoogleCloud/Submodule04_Intro_to_Exploratory_Analysis.ipynb#overview>Next section</a>\n",
    "        </div> \n",
    "    </div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a43ba4c4",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e514c00-925a-4dcd-ae0a-cb757b0de30b",
   "metadata": {},
   "source": [
    "### Core Reading\n",
    "[Boyd S, Vandenberghe L: Introduction to Applied Linear Algebra (Vectors, Matrices, and Least Squares). 2018, Cambridge University Press][boyd]<br>\n",
    "[Dunn PK, Smyth GK: Generalized Linear Models With Examples in R.  2018, Springer Science+Business Media, LLC][dunn]<br>\n",
    "\n",
    "\n",
    "### Further Reading\n",
    "[Zelterman D: Applied Multivariate Statistics with R.  2015, Springer International Publishing AG Switzerland][zelterman]<br>\n",
    "[Dean A, Voss D, Draguljic D: Design and Analysis of Experiments.  2017, Springer International Publishing AG Switzerland][dean]<br>\n",
    "[Heiberger RM, Holland B: Statistical Analysis and Data Display (An Imtermediate Course with Examples in R).  2015, Springer Science+Business Media, LLC][Heiberger]<br>\n",
    "\n",
    "[boyd]: https://www.cambridge.org/highereducation/books/introduction-to-applied-linear-algebra/4D69AF22E38303FE20FFEEFDCE0E7F96#overview \"Boyd S, Vandenberghe L.: Introduction to Applied Linear Algebra (Vectors, Matrices, and Least Squares). 2018, Cambridge University Press\"\n",
    "[dunn]: https://link.springer.com/book/10.1007/978-1-4419-0118-7 \"Generalized Linear Models With Examples in R.  2018, Springer Science+Business Media, LLC\"\n",
    "[zelterman]: https://link.springer.com/book/10.1007/978-3-319-14093-3 \"Zelterman D: Applied Multivariate Statistics with R.  2015, Springer International Publishing AG Switzerland\"\n",
    "[dean]: https://link.springer.com/book/10.1007/978-3-319-52250-0 \"Dean A, Voss D, Draguljic D: Design and Analysis of Experiments.  2017, Springer International Publishing AG Switzerland\"\n",
    "[heiberger]: https://link.springer.com/book/10.1007/978-1-4939-2122-5 \"Heiberger RM, Holland B: Statistical Analysis and Data Display (An Imtermediate Course with Examples in R).  2015, Springer Science+Business Media, LLC\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R (Local)",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.4.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}