[2cc208]: / demo / notebooks / demo_validate.ipynb

Download this file

308 lines (307 with data), 9.3 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bootstrap Internal Validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we will describe the code that implements the procedure we use to perform bootstrap-based internal validation of the models compared in our paper. We show the application of this validation procedure to the conventional parameter model. A similar procedure was applied to the deep learning model. \n",
    "\n",
    "In order to get a sense of how well our model would generalize to an external validation cohort, we assessed its predictive accuracy within the training sample using a bootstrap-based procedure recommended in the guidelines for *Transparent Reporting of a multivariable model for Individual Prognosis Or Diagnosis (Tripod)*. This procedure attempts to derive realistic, 'optimism-adjusted' estimates of the model's generalization accuracy using the training sample\n",
    "\n",
    "We first import required libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "import os, sys, pickle\n",
    "import optunity, lifelines\n",
    "from lifelines.utils import concordance_index\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we import the functions required to train the conventional parameter model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "sys.path.insert(0, '../code')\n",
    "from CoxReg_Single_run import *\n",
    "from hypersearch import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import data, where `x_full` is the $n \\times p$ input matrix of volumetric measures ($n$=sample size, $p$=dimensionality of input vector), `y_full` is an $n \\times 2$ matrix of outcomes where column 1 represents censoring status and column 2 represents survival/censoring time. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "with open('../data/inputdata_conv.pkl', 'rb') as f: c3 = pickle.load(f)\n",
    "x_full = c3[0]\n",
    "y_full = c3[1]\n",
    "del c3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize empty lists to store predictions and performance measures:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "preds_bootfull = []\n",
    "inds_inbag = []\n",
    "Cb_opts  = []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we implement each step of the bootstrap internal validation procedure, as outlined in the manuscript:\n",
    "\n",
    "### Step 1\n",
    "Train a prediction model on the full sample:\n",
    "#### 1(a) : Find optimal hyperparameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "opars, osummary = hypersearch_cox(x_data=x_full, y_data=y_full, method='particle swarm', nfolds=6, nevals=50, penalty_range=[-2,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(b) : using optimal hyperparameters, train a model on full sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "omod = coxreg_single_run(xtr=x_full, ytr=y_full, penalty=10**opars['penalty'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1(c) : Compute Harrell's Concordance index ($C_{full}^{full}$)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "predfull = omod.predict_partial_hazard(x_full)\n",
    "C_app = concordance_index(y_full[:,1], -predfull, y_full[:,0])\n",
    "\n",
    "print('\\n\\n==================================================')\n",
    "print('Apparent concordance index = {0:.4f}'.format(C_app))\n",
    "print('==================================================\\n\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`C_app` ($C_{full}^{full}$) represents the apparent predictive accuracy, i.e. the inflated accuracy obtained when a model is tested on the same sample on which it was trained/optimized \n",
    "\n",
    "In the next steps, we use bootstrap sampling to estimate the optimism, which we then use to adjust the apparent predictive accuracy.\n",
    "\n",
    "#### Bootstrap sampling\n",
    "We will take B = 100 bootstrap samples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "#define useful variables\n",
    "nsmp = len(x_full)\n",
    "rowids = [_ for _ in range(nsmp)]\n",
    "B = 100\n",
    "\n",
    "for b in range(B):\n",
    "    print('\\n-------------------------------------')\n",
    "    print('Current bootstrap sample:', b, 'of', B-1)\n",
    "    print('-------------------------------------')\n",
    "\n",
    "    #STEP 2: Generate a bootstrap sample by doing n random selections with replacement (where n is the sample size)\n",
    "    b_inds = np.random.choice(rowids, size=nsmp, replace=True)\n",
    "    xboot = x_full[b_inds]\n",
    "    yboot = y_full[b_inds]\n",
    "\n",
    "    #(2a) find optimal hyperparameters\n",
    "    bpars, bsummary = hypersearch_cox(x_data=xboot, y_data=yboot, method='particle swarm', nfolds=6, nevals=50, penalty_range=[-2,1])\n",
    "    \n",
    "    #(2b) using optimal hyperparameters, train a model on bootstrap sample\n",
    "    bmod = coxreg_single_run(xtr=xboot, ytr=yboot, penalty=10**bpars['penalty'])\n",
    "    \n",
    "    #(2c[i])  Using bootstrap-trained model, compute predictions on bootstrap sample. Evaluate accuracy of predictions (Harrell's Concordance index)\n",
    "    predboot = bmod.predict_partial_hazard(xboot)\n",
    "    Cb_boot = concordance_index(yboot[:,1], -predboot, yboot[:,0])\n",
    "    \n",
    "    #(2c[ii]) Using bootstrap-trained model, compute predictions on FULL sample.     Evaluate accuracy of predictions (Harrell's Concordance index)\n",
    "    predbootfull = bmod.predict_partial_hazard(x_full)\n",
    "    Cb_full = concordance_index(y_full[:,1], -predbootfull, y_full[:,0])\n",
    "\n",
    "    #STEP 3: Compute optimism for bth bootstrap sample, as difference between results from 2c[i] and 2c[ii]\n",
    "    Cb_opt = Cb_boot - Cb_full\n",
    "    \n",
    "    #store data on current bootstrap sample (predictions, C-indices)\n",
    "    preds_bootfull.append(predbootfull)\n",
    "    inds_inbag.append(b_inds)\n",
    "    Cb_opts.append(Cb_opt)\n",
    "\n",
    "    del bpars, bmod"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we compute bootstrap-estimated optimism, by averaging the optimism estimates across the B bootstrap samples: $$\\frac{1}{B}\\sum_{b=1}^{B} \\bigg( C_{b}^{b} - C_{b}^{full} \\bigg)$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "C_opt = np.mean(Cb_opts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we adjust the apparent C using the bootstrap-estimated optimism:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "C_adj = C_app - C_opt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we compute confidence intervals for optimism-adjusted C:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "C_opt_95confint = np.percentile([C_app - o for o in Cb_opts], q=[2.5, 97.5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false},
   "outputs": [],
   "source": [
    "print('Optimism bootstrap estimate = {0:.4f}'.format(C_opt))\n",
    "print('Optimism-adjusted concordance index = {0:.4f}, and 95% CI = {1}'.format(C_adj, C_opt_95confint))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}