Switch to side-by-side view

--- a
+++ b/demo/notebooks/demo_KMplot.ipynb
@@ -0,0 +1,156 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Generation of Kaplan-Meier plots"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook describe the code for constructing Kaplan-Meier plots for model predictions. In our paper, we use this code to generating KM plots for both our deep learning network (4DSurvival) and also our conventional parameter model.\n",
+    "\n",
+    "In the current context, we use Kaplan-Meier plots to provide a visual depiction of the risk stratification ability of our prediction models. This is done using the output of our bootstrap-based internal validation procedure (described in our paper). Recall that each bootstrap sample is created by taking $n$ random draws (with replacement) from the full sample (where the $n$ is the number of unique subjects in the full sample). Due to sampling with replacement, each bootstrap sample will likely contain replicates of the same subjects, and some subjects from the full sample (from which the bootstrap sample was drawn) will be absent from the bootstrap sample. In fact, on average, and for large $n$, each boostrap sample will only contain ~63.2% ($=1 - e^{-1}$) of the subjects from the full sample. This means that for each bootstrap sample, there will almost always be a fraction of subjects excluded. In machine learning literature, this subsample is sometimes referred to as the *out-of-bag* subsample (and conversely, subjects included in the bootstrap sample are termed '*in-bag*'). For a model trained on a particular bootstrap sample $b$, we can compute its predicted values for subjects in the *out-of-bag* subsample of $b$. And after training a series of models over $b = {1,...,B}$ bootstrap samples, we can, for each subject in the full sample, identify the bootstrap samples for which that subject was *out-of-bag*, and average the subject's predictions across these bootstrap samples. Thus for each subject, this will yield a predicted risk computed by aggregating predictions from models trained with data excluding that subject. This yields unbiased predicted risks for each subject. We generate Kaplan-Meier plots for the full sample using these *out-of-bag* predictions.\n",
+    "\n",
+    "We now describe the code step by step.\n",
+    "First, required libraries are imported:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import pickle\n",
+    "import lifelines\n",
+    "from lifelines import KaplanMeierFitter\n",
+    "import matplotlib\n",
+    "import matplotlib.pyplot as plt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we import data we will use. Recall that this data is the output of our bootstrap-based model validation procedure."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false},
+   "outputs": [],
+   "source": [
+    "with open('../data/bootout_conv.pkl', 'rb') as f: inputdata_list=pickle.load(f)\n",
+    "y_orig = inputdata_list[0]\n",
+    "preds_bootfull = inputdata_list[1]\n",
+    "inds_inbag = inputdata_list[2]\n",
+    "del inputdata_list"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the above code, `y_orig` is an $n \\times 2$ matrix of survival outcomes for the full sample, the first column of which represents censoring status while the second column represents survival/censoring times. \n",
+    "\n",
+    "Variable `preds_bootfull` represents predictions of the models fit on $B$ bootstrap samples, for each subject in the full sample. Specifically, variable `preds_bootfull` is a list of length $B$ (where $B$ is the total number of bootstrap samples used in the model validation procedure). Each element in this list is a *numpy* array of dimensions $n \\times 1$, where $n$ is the sample size (number of subjects in the full sample). For the array in list item $b$ of `preds_bootfull`, each element $i$ ($i = 1,..,n$) of this array represents the predicted risk for subject $i$ (in the full sample) by the model trained on the $b^{th}$ bootstrap sample. \n",
+    "\n",
+    "Finally, `inds_inbag` is a list of length $B$ where each list item $b$ ($b=1,...,B$) gives the indices of subjects from the full sample who were selected in the $b^{th}$ bootstrap sample (i.e. '*in-bag*' subjects). Note that because bootstrap samples are by definition selected with replacement, there will be several relicated indices (representing subjects selected multiple times in a bootstrap sample). Each list item $b$ in `inds_inbag` will have length $n$ (the number of selections in each bootstrap sample). The information in `inds_inbag` allows us to determine, for each bootstrap sample $b$, which subjects in the full sample were *out-of-bag*. This will be used to extract the aforementioned *out-of-bag* predictions from variable `preds_bootfull`, which we demonstrate below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false},
+   "outputs": [],
+   "source": [
+    "preds_bootfull_mat = np.concatenate(preds_bootfull, axis=1)\n",
+    "inds_inbag_mat = np.array(inds_inbag).T\n",
+    "inbag_mask = 1*np.array([np.any(inds_inbag_mat==_, axis=0) for _ in range(inds_inbag_mat.shape[0])])\n",
+    "preds_bootave_oob = np.divide(np.sum(np.multiply((1-inbag_mask), preds_bootfull_mat), axis=1), np.sum(1-inbag_mask, axis=1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Above, variable `preds_bootave_oob` represents the *out-of-bag* predictions for all subjects in the full sample.\n",
+    "\n",
+    "As mentioned in the paper, these predicted risks are used to categorize subjects into low- and high-risk groups, according to the risk score sample median:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false},
+   "outputs": [],
+   "source": [
+    "risk_groups = 1*(preds_bootave_oob > np.median(preds_bootave_oob))\n",
+    "wdf =  pd.DataFrame(np.concatenate((y_orig,preds_bootave_oob[:,np.newaxis],risk_groups[:,np.newaxis]), axis=-1), columns=['status','time','preds','risk_groups'], index=[str(_) for _ in risk_groups])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We then plot the Kaplan-Meier curves below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "editable": false},
+   "outputs": [],
+   "source": [
+    "kmf = KaplanMeierFitter()\n",
+    "ax = plt.subplot(111)\n",
+    "kmf.fit(durations=wdf.loc['0','time'], event_observed=wdf.loc['0','status'], label=\"Low Risk\")\n",
+    "ax = kmf.plot(ax=ax)\n",
+    "kmf.fit(durations=wdf.loc['1','time'], event_observed=wdf.loc['1','status'], label=\"High Risk\")\n",
+    "ax = kmf.plot(ax=ax)\n",
+    "plt.ylim(0,1)\n",
+    "plt.title(\"Kaplan-Meier Plots\")\n",
+    "plt.xlabel('Time (days)')\n",
+    "plt.ylabel('Survival Probability')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}