# Statistical comparison of model performance

In this notebook, we will describe how we compare the accuracy of the 2 predictive models used in this study: the deep learning network (4DSurvival) and the conventional parameter model (utilizing volumetric indices of RV function).

To compare 2 models, we compare their optimism-corrected concordance indices (which are derived by the bootstrap internal validation procedure outlined in our paper). From equation (9) in the paper, the optimism-corrected concordance index for each model is given by: $$C_{corrected} = C_{full}^{full} - \frac{1}{B}\sum_{b=1}^{B} \bigg( C_{b}^{b} - C_{b}^{full} \bigg)$$
In the above equation, the symbol $C_{s_{1}}^{s_{2}}$ refers to the concordance index of a model trained on sample $s_1$ and tested on sample $s_2$. The first term refers to the *apparent* predictive accuracy, i.e. the (inflated) concordance index obtained when a model trained on the full sample is then tested on the same sample. The second term is the average *optimism* (difference between *bootstrap performance* and *test performance*) over the $B$ bootstrap samples.
Note that we can rewrite the equation above as: $$C_{corrected} = \frac{1}{B}\sum_{b=1}^{B} \bigg[ C_{full}^{full} - \bigg( C_{b}^{b} - C_{b}^{full} \bigg) \bigg]$$

In this formulation, we can think of the summand (the term within the summation) as the optimism-corrected concordance index based on only 1 particular bootstrap sample $b$ ($C_{b,corrected}$). Averaging this quantity across $b=\{1,...,B\}$ bootstrap samples gives $C_{corrected}$. 
To compare $C_{corrected}$ between 2 competing models, we perform a statistical test comparing the distributions of $C_{b,corrected}$ ($b=\{1,...,B\}$) between the 2 models. We use the Wilcoxon rank-sum test for this purpose. This is implemented in the code below:

------------------------------------------------------------------------------------------------------
Import required libraries:

In [None]:
import scipy
from scipy.stats import wilcoxon
import numpy as np
import pickle

For each model under comparison, read in bootstrap data and compute $C_{b,corrected}$ ($b = \{1,...,B\}$):

In [None]:
def p_reader(pfile):
    with open(pfile, 'rb') as f: mlist = pickle.load(f)
    return mlist[0], mlist[1]

C_app_model1, opts_model1 = p_reader('../data/modelCstats_DL.pkl')
C_app_model2, opts_model2 = p_reader('../data/modelCstats_conv.pkl')

In [None]:
Cb_adjs_model1 = [C_app_model1 - o for o in opts_model1]
Cb_adjs_model2 = [C_app_model2 - o for o in opts_model2]

Perform Wilcoxon signed-rank test and compute p-value using the:

In [None]:
pval = wilcoxon(Cb_adjs_model1, Cb_adjs_model2)

Print output:

In [None]:
print('Model 1 optimism-adjusted concordance index = {0:.4f}\nModel 2 optimism-adjusted concordance index = {1:.4f}\np-value = {2}'.format(np.mean(Cb_adjs_model1), np.mean(Cb_adjs_model2), pval.pvalue))