--- a +++ b/demo/notebooks/demo_validate.ipynb @@ -0,0 +1,307 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bootstrap Internal Validation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this notebook, we will describe the code that implements the procedure we use to perform bootstrap-based internal validation of the models compared in our paper. We show the application of this validation procedure to the conventional parameter model. A similar procedure was applied to the deep learning model. \n", + "\n", + "In order to get a sense of how well our model would generalize to an external validation cohort, we assessed its predictive accuracy within the training sample using a bootstrap-based procedure recommended in the guidelines for *Transparent Reporting of a multivariable model for Individual Prognosis Or Diagnosis (Tripod)*. This procedure attempts to derive realistic, 'optimism-adjusted' estimates of the model's generalization accuracy using the training sample\n", + "\n", + "We first import required libraries:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "import os, sys, pickle\n", + "import optunity, lifelines\n", + "from lifelines.utils import concordance_index\n", + "import numpy as np" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we import the functions required to train the conventional parameter model:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "sys.path.insert(0, '../code')\n", + "from CoxReg_Single_run import *\n", + "from hypersearch import *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Import data, where `x_full` is the $n \\times p$ input matrix of volumetric measures ($n$=sample size, $p$=dimensionality of input vector), `y_full` is an $n \\times 2$ matrix of outcomes where column 1 represents censoring status and column 2 represents survival/censoring time. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "with open('../data/inputdata_conv.pkl', 'rb') as f: c3 = pickle.load(f)\n", + "x_full = c3[0]\n", + "y_full = c3[1]\n", + "del c3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Initialize empty lists to store predictions and performance measures:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "preds_bootfull = []\n", + "inds_inbag = []\n", + "Cb_opts = []" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we implement each step of the bootstrap internal validation procedure, as outlined in the manuscript:\n", + "\n", + "### Step 1\n", + "Train a prediction model on the full sample:\n", + "#### 1(a) : Find optimal hyperparameters" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "opars, osummary = hypersearch_cox(x_data=x_full, y_data=y_full, method='particle swarm', nfolds=6, nevals=50, penalty_range=[-2,1])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 1(b) : using optimal hyperparameters, train a model on full sample" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "omod = coxreg_single_run(xtr=x_full, ytr=y_full, penalty=10**opars['penalty'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 1(c) : Compute Harrell's Concordance index ($C_{full}^{full}$)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "predfull = omod.predict_partial_hazard(x_full)\n", + "C_app = concordance_index(y_full[:,1], -predfull, y_full[:,0])\n", + "\n", + "print('\\n\\n==================================================')\n", + "print('Apparent concordance index = {0:.4f}'.format(C_app))\n", + "print('==================================================\\n\\n')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`C_app` ($C_{full}^{full}$) represents the apparent predictive accuracy, i.e. the inflated accuracy obtained when a model is tested on the same sample on which it was trained/optimized \n", + "\n", + "In the next steps, we use bootstrap sampling to estimate the optimism, which we then use to adjust the apparent predictive accuracy.\n", + "\n", + "#### Bootstrap sampling\n", + "We will take B = 100 bootstrap samples" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "#define useful variables\n", + "nsmp = len(x_full)\n", + "rowids = [_ for _ in range(nsmp)]\n", + "B = 100\n", + "\n", + "for b in range(B):\n", + " print('\\n-------------------------------------')\n", + " print('Current bootstrap sample:', b, 'of', B-1)\n", + " print('-------------------------------------')\n", + "\n", + " #STEP 2: Generate a bootstrap sample by doing n random selections with replacement (where n is the sample size)\n", + " b_inds = np.random.choice(rowids, size=nsmp, replace=True)\n", + " xboot = x_full[b_inds]\n", + " yboot = y_full[b_inds]\n", + "\n", + " #(2a) find optimal hyperparameters\n", + " bpars, bsummary = hypersearch_cox(x_data=xboot, y_data=yboot, method='particle swarm', nfolds=6, nevals=50, penalty_range=[-2,1])\n", + " \n", + " #(2b) using optimal hyperparameters, train a model on bootstrap sample\n", + " bmod = coxreg_single_run(xtr=xboot, ytr=yboot, penalty=10**bpars['penalty'])\n", + " \n", + " #(2c[i]) Using bootstrap-trained model, compute predictions on bootstrap sample. Evaluate accuracy of predictions (Harrell's Concordance index)\n", + " predboot = bmod.predict_partial_hazard(xboot)\n", + " Cb_boot = concordance_index(yboot[:,1], -predboot, yboot[:,0])\n", + " \n", + " #(2c[ii]) Using bootstrap-trained model, compute predictions on FULL sample. Evaluate accuracy of predictions (Harrell's Concordance index)\n", + " predbootfull = bmod.predict_partial_hazard(x_full)\n", + " Cb_full = concordance_index(y_full[:,1], -predbootfull, y_full[:,0])\n", + "\n", + " #STEP 3: Compute optimism for bth bootstrap sample, as difference between results from 2c[i] and 2c[ii]\n", + " Cb_opt = Cb_boot - Cb_full\n", + " \n", + " #store data on current bootstrap sample (predictions, C-indices)\n", + " preds_bootfull.append(predbootfull)\n", + " inds_inbag.append(b_inds)\n", + " Cb_opts.append(Cb_opt)\n", + "\n", + " del bpars, bmod" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we compute bootstrap-estimated optimism, by averaging the optimism estimates across the B bootstrap samples: $$\\frac{1}{B}\\sum_{b=1}^{B} \\bigg( C_{b}^{b} - C_{b}^{full} \\bigg)$$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "C_opt = np.mean(Cb_opts)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we adjust the apparent C using the bootstrap-estimated optimism:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "C_adj = C_app - C_opt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we compute confidence intervals for optimism-adjusted C:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "C_opt_95confint = np.percentile([C_app - o for o in Cb_opts], q=[2.5, 97.5])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false}, + "outputs": [], + "source": [ + "print('Optimism bootstrap estimate = {0:.4f}'.format(C_opt))\n", + "print('Optimism-adjusted concordance index = {0:.4f}, and 95% CI = {1}'.format(C_adj, C_opt_95confint))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}