Diff of /README.md [000000] .. [1654c6]

Switch to unified view

a b/README.md
1
# Insights into the I-SPY clinical trial
2
#### by Julio Cardenas-Rodriguez (@jdatascientist)
3
4
![design](./images/0-study_design.png)
5
6
## 1. Description and Objectives
7
The goal of this project is to improve the prediction of clinical outcomes to neoadjuvant chemotherapy in patients with breast cancer. Currently, most patients with breast cancer undergo neoadjuvant chemotherapy, which is aimed at reducing the size of a tumor (burden) before surgery to remove the tumor or the entire breast.   
8
Some of the patients response completely to the therapy and the patient does not present any residual tumor at the time of surgery (Pathologic complete response or `PCR`). On the other hand, most patients have residual disease at the time of surgery and further treatment or surgery is required.
9
10
## 2. Data source
11
All data for the **222 patients** treated for breast cancer in the IPSY-1 clinical trial was obtained from the [cancer imaging archive](https://wiki.cancerimagingarchive.net/display/Public/ISPY1) and the Breast Imaging Research Program at UCSF. To facilitate the dissemination and reproducibility of this analysis, the raw data and all code were posted at [Data.World](https://data.world/julio/ispy-1-trial) and [Github](https://github.com/JCardenasRdz/Insights-into-the-I-SPY-clinical-trial) and are available under an MIT license.
12
13
## 3. Source code in Python and data analysis
14
The code is organized in a Python package (`ispy1`), with modules for each of the four steps of the data analysis
15
16
> - ispy1
17
>   - clean_data.py
18
>   - inferential_statistics.py
19
>   - predictive_statistics.py
20
>   - survival_analysis.py
21
22
## 4. Description of the data
23
> The data contained in the cancer imaging archive is organized column-wise for all subjects as follows (rows = patients).
24
25
**Clinical Outcomes**
26
1. Survival Status at the end of the study (`Survival`):
27
    - 7 = Alive
28
    - 8 = Dead
29
    - 9 = Lost to follow up
30
2. Length of Survival (`Survival_length`):
31
    - Days from study entry to death or last follow-up
32
3. Recurrence-free survival (`RFS`):
33
    - days from from NCAC start until progression or death
34
4. Recurrence-free survival indicator (`RFS_code`)
35
    - progression or death (1),
36
    - removed from survival curve (0)
37
5. Pathologic Complete Response (`PCR`) post-neoadjuvant ?:
38
    - 1 = Yes
39
    - 0 = No
40
    - Lost (Blank)
41
6. Residual Cancer Burden class (`RCB`):
42
    - 0 = RCB index (Class 0)
43
    - 1 = RCB index less than or equal to 1.36 (Class I)
44
    - 2 = RCB index greater than 1.36 or equal to 3.28  (Class II)
45
    - 3 = III, RCB index greater than 3.28 (Class III)
46
    - Blank = unavailable or no surgery
47
48
**Predictors of clinical outcomes**
49
1. `Age` (Years)
50
2. `Race`, encoded as:
51
    - 1 = Caucasian
52
    - 3 = African American
53
    - 4 = Asian
54
    - 5 = Native Hawaiian
55
    - 6 = American Indian
56
    - 50 = Multiple race
57
3. Estrogen Receptor Status (`ER+`) encoded as:
58
    - 1 = Positive
59
    - 0 = Negative
60
    - Blank = Indeterminate
61
4. Progesterone Receptor Status (`PR+`) encoded as:
62
    - 1 = Positive
63
    - 0 = Negative
64
    - Blank = Indeterminate
65
5. Hormone Receptor Status (`ER+`)
66
    - 1 = Positive
67
    - 0 = Negative
68
    - Blank = Indeterminate
69
6. Bilateral Breast Cancer (`Bilateral`):
70
    - 1 = Cancer Detected on both breasts
71
    - 0 = Cancer Detected in a single breast
72
7. Breast with major or single Tumor (`Laterality`):
73
    - 1 = Left breast
74
    - 2 = Right breast
75
8. Largest tumor dimension at Baseline estimated by MRI (`MRI_LD_Baseline`, continous variable)
76
9. Largest tumor dimension 1-3 days after NAC estimated by MRI (`MRI_LD_1_3dAC`, continous variable)
77
10. Largest tumor dimension between cycles of NAC estimated by MRI (`MRI_LD_Int_Reg`, continous variable)
78
11. Largest tumor dimension before surgery estimated by MRI (`MRI_LD_PreSurg`, continous variable)
79
80
## 5. Data cleaning and organizing
81
The data for this study was provided as an excel file (.xls) with multiple fields and is not suitable to construct the contingency tables required for inferential statistics or to peform predictive statistics using `sklearn` and `statsmodels`. The module `clean_data` of the `ipsy1` was used to clean the data and generate a pandas dataframe. The code for  `clean_data` module can be found [here](https://gist.github.com/JCardenasRdz/75dd152afe6250a5c7de2315b2a2a960).  
82
83
```Python
84
# load module by Julio and pandas
85
from ispy1 import clean_data
86
import pandas as pd
87
88
file = './data/I-SPY_1_All_Patient_Clinical_and_Outcome_Data.xlsx'
89
df = clean_data.clean_my_data(file)
90
df.head(2)
91
92
# save clean data in new  csv file
93
df.to_csv('./data/I-SPY_1_clean_data.csv')
94
df.head(2)
95
```
96
![df](./images/1.png)
97
98
## 6. Inferential Statistics
99
The objective of inferential statistics is to estimate information about populations and test if two (or more) populations are statistically the same. The analysis for this project is organized according to the type of predictors ( categorical or continous) and their effect on categorical outcomes.
100
- Load data
101
102
```Python
103
# standard modules
104
import seaborn as sns
105
import pandas as pd
106
import matplotlib.pyplot as plt
107
108
# module wrote by julio
109
from ispy1 import inferential_statistics
110
df = pd.read_csv('./data/I-SPY_1_clean_data.csv')
111
```
112
1. Inferential_statistics: Categorical vs Categorical (Chi-2 test)   
113
114
The first thing needed to perform this kind of analysis is to construct contingency tables to establish the frequency of observations for each category being studied for example:
115
116
```Python
117
>>> inferential_statistics.contingency_table('PCR', 'ER+',df)
118
ER+   Yes    No
119
PCR            
120
Yes  17.0  28.0
121
No   81.0  42.0
122
```
123
Now, we can perform the chi-2 test to the effect of multiple categorical predictors on `PCR`:
124
125
- Effect of categorical predictors on Pathological complete response (`PCR`)
126
```Python
127
>>> predictors = ['White', 'ER+', 'PR+', 'HR+','Right_Breast']
128
>>> outcome = 'PCR'
129
>>> inferential_statistics.categorical_data(outcome, predictors, df)
130
               p-value  Relative_Risk   RR_lb   RR_ub
131
White         0.833629         0.8878  0.5076  1.5528
132
ER+           0.001988         0.4337  0.2582  0.7285
133
PR+           0.000198         0.3219  0.1707  0.6069
134
HR+           0.000307         0.3831  0.2286  0.6422
135
Right_Breast  0.851883         1.0965  0.6649  1.8080
136
```
137
138
These results indicate that because `ER+`,`PR+`, and `HR+` show a p-value < 0.05 we reject the null hypothesis of indepedence and conclude that `PCR` is not independent from `ER+`,`PR+`, and `HR+`. Furthermore, the relative risk indicates that `ER+`,`PR+`, and `HR+` are associate with reduce probability of `PCR`, in other words, being positive for these markers reduce the chances of responding to the NAC.
139
140
- Effect of categorical predictors on Pathological complete response (`Alive`)
141
```Python
142
>>> outcome = 'Alive'
143
>>> inferential_statistics.categorical_data(outcome, predictors, df)
144
               p-value  Relative_Risk   RR_lb   RR_ub
145
White         0.439359         1.0935  0.9032  1.3239
146
ER+           0.001135         1.3095  1.1025  1.5554
147
PR+           0.162557         1.1266  0.9739  1.3031
148
HR+           0.038917         1.1950  1.0094  1.4148
149
Right_Breast  0.729139         0.9602  0.8287  1.1125
150
```
151
152
These results indicate that because `ER+`,and `HR+` have a mild effect on the chances of survival (p-value < 0.5), but they relative risk indicates that the effect is very close to 1.0, meaning that being `ER+` or `HER+` has little effect on survival according to the chi-2 test, a complete survival analysis is performed in section 3.0.   
153
154
155
2. Inferential_statistics: Continous vs Categorical (ANOVA)   
156
157
An analysis using continous predictors for a categorical outcome requires using analysis of variance (ANOVA). I implemented this technique in the `inferential_statistics` module of `isp1`.
158
159
- Effect of Age on PCR
160
161
```Python
162
>>> predictor= ['age']
163
>>> outcome = 'PCR'
164
>>> anova_table, OLS = inferential_statistics.linear_models(df, outcome, predictor);
165
---------------------------------------------
166
             sum_sq     df         F    PR(>F)
167
age        0.256505    1.0  1.302539  0.255394
168
Residual  32.689923  166.0       NaN       NaN
169
---------------------------------------------
170
```
171
Age clearly does not have an effect (is associated) with PCR. The effect so small that we can even conclude this just by looking at a grouped histogram:  
172
173
```Python
174
>>> sns.boxplot(x= outcome, y=predictor[0], data=df, palette="Set3");
175
>>> plt.show()
176
```
177
![anova_age_pcr](./images/2_anova_age_pcr.png)
178
179
180
- Effect of `Age` on survival (`Alive`)
181
182
The ANOVA for this interaction indicates that `Age` has an effect on survival (`Alive`). It technically would bot be significant at the 95% confidence level (p-value = 0.06), but it would at the 94% confidence level.
183
184
```Python
185
>>> predictor= ['age']
186
>>> outcome = 'Alive'
187
>>> anova_table, OLS = inferential_statistics.linear_models(df, outcome, predictor);
188
---------------------------------------------
189
             sum_sq     df         F    PR(>F)
190
age        0.062227    1.0  0.399719  0.528104
191
Residual  25.842534  166.0       NaN       NaN
192
---------------------------------------------
193
```
194
195
A simple boxplot shows that older patients are less likely to be `Alive` by the end of this study.
196
197
```Python
198
>>> sns.boxplot(x= outcome, y=predictor[0], data=df, palette="Set3");
199
<matplotlib.axes._subplots.AxesSubplot object at 0x111aff080>
200
>>> plt.show()
201
```
202
203
![anova_age_alive](./images/3_anova_age_alive.png)
204
205
- Explore interactions between age, survival, and PCR
206
207
A very interesting fact about NAC and `PCR`, is that not all patients who achieve `PCR` survive until the end of the study. As you can see below, 4 out of 41 patients who achieved `PCR` did not survive until the end of the study, while 95 / 123 who patients who did NOT achieve  `PCR` still lived until the end of the study.
208
209
```Python
210
>>> inferential_statistics.contingency_table('PCR', 'Alive',df)
211
Alive   Yes    No
212
PCR              
213
Yes    41.0   4.0
214
No     95.0  28.0
215
```
216
Thus, there must be other factors (covariates) that can account for this difference. We can explore the effect of `Age` first by creating a histogram that splits the groups in four according to `PCR` = Yes / No and `Alive` = Yes / NO.
217
```Python
218
# create a boxplot to visualize this interaction
219
>>> ax = sns.boxplot(x= 'PCR', y='age', hue ='Alive',data=df, palette="Set3");
220
>>> ax.set_title('Interactions between age, survival, and PCR');
221
>>> plt.show()
222
```
223
![int](./images/4_interactions_pcr_alive_age_v2.png)
224
225
It is evident from the boxplots that `Age` has an effect on on survival and it is affected by the `PCR` status. For example, younger patients with `PCR` = Yes seem more likely to be alive by the end of the study. We can perform ANOVA only for patients for whom `PCR` = Yes. The table below shows that the p-value is < 0.01, which means that we are confident at the 99% level that age has an effect on survival for those patients with `PCR` =Yes.
226
227
```Python
228
229
# create dataframe only for patients with PCR = Yes
230
>>> df_by_PCR = df.loc[df.PCR=='Yes',:]
231
232
# Anova age vs Alive
233
>>> predictor= ['age']
234
>>> outcome = 'Alive'
235
>>> anova_table, OLS = inferential_statistics.linear_models(df_by_PCR, outcome, predictor);
236
---------------------------------------------
237
            sum_sq    df         F    PR(>F)
238
age       0.539468   1.0  7.470952  0.009065
239
Residual  3.104976  43.0       NaN       NaN
240
---------------------------------------------
241
```
242
243
The same analysis can be repeated for patients with `PCR` = No. Which results in a p-value of ~ 0.060, which is not statistically significant at the 5% confidence level but is fairly close. In other words, `age`, `PCR`, and `Alive` interact very strongly. The effect of these interactions will be quantified in the predictive statistics section using logistic regression.
244
245
```Python
246
247
# create dataframe only for patients with PCR = Yes
248
>>> df_by_PCR = df.loc[df.PCR=='No',:]
249
250
# Anova age vs Alive
251
>>> predictor= ['age']
252
>>> outcome = 'Alive'
253
>>> anova_table, OLS = inferential_statistics.linear_models(df_by_PCR, outcome, predictor);
254
---------------------------------------------
255
             sum_sq     df         F    PR(>F)
256
age        0.637369    1.0  3.674443  0.057611
257
Residual  20.988648  121.0       NaN       NaN
258
---------------------------------------------
259
```
260
261
-  Effect of MRI measurements on `PCR` : ANOVA
262
As part of this study, the largest tumor dimension (`LD`) was measured for all patients at for different time points:
263
1. `MRI_LD_Baseline`: Before the first NAC regime is started.
264
2. `MRI_LD_1_3dAC`: 1-3 days after starting the first NAC regime.
265
3. `MRI_LD_Int_Reg`: Between the end of the first regime and the start of the second regime.
266
4. `MRI_LD_PreSurg`: Before surgery.
267
268
The `inferential_statistics` module contains a function to perform ANOVA between each one of these MRI measurements and a particular outcome. The code and results for `PCR` are:
269
270
```Python
271
outcome = 'PCR'
272
R = inferential_statistics.anova_MRI(outcome, df);
273
```
274
![5_anova_mri.png](./images/5_anova_mri_pcr.png)
275
276
which indicate that all low MRI measurements with the exception of `MRI_LD_Baseline` are statistically associated with `PCR`. However, an statistically significant result is not always clinically relevant, for that we need to look at the [effect size](https://en.wikipedia.org/wiki/Effect_size) (ES). The ES is defined as the ratio of the ratio of the mean for each group divide by the standard deviation of the entire data. As it can be seen below, the effect size for MRI measurements are small:
277
278
```Python
279
>>> mri_features = ['MRI_LD_Baseline', 'MRI_LD_1_3dAC', 'MRI_LD_Int_Reg', 'MRI_LD_PreSurg']
280
>>> outcome = 'Alive'
281
# Effect Size
282
>>> inferential_statistics.effect_size( df, mri_features, outcome)
283
Effect Size
284
Predictor of Alive             
285
MRI_LD_Baseline        0.375046
286
MRI_LD_1_3dAC          0.357002
287
MRI_LD_Int_Reg         0.678682
288
MRI_LD_PreSurg         0.469548
289
```
290
291
-  Effect of MRI measurements on `Alive`: ANOVA
292
293
```Python
294
outcome = 'PCR'
295
R = inferential_statistics.anova_MRI(outcome, df);
296
```
297
298
![6_anova_mri_alive](./images/6_anova_mri_alive.png)
299
300
These results indicate that all all MRI measurements are statistically associated with survival (`Alive`), but it is also good practice to calculate the effect size to estimate how big the differences are between patients who survived and those who did not.
301
302
```Python
303
>>> mri_features = ['MRI_LD_Baseline', 'MRI_LD_1_3dAC', 'MRI_LD_Int_Reg', 'MRI_LD_PreSurg']
304
>>> outcome = 'Alive'
305
306
# Effect Size
307
>>> inferential_statistics.effect_size( df, mri_features, outcome)
308
309
Effect Size
310
Predictor of Alive             
311
MRI_LD_Baseline        0.375046
312
MRI_LD_1_3dAC          0.357002
313
MRI_LD_Int_Reg         0.678682
314
MRI_LD_PreSurg         0.469548
315
```
316
Finally, it is import to consider that only about 25% of all patients achieved `PCR` but even 56% did not achieve `PCR` they lived for the entire duration of the study (code below). Furthermore, these results do not indicate how long a patient will live on average (survival), not can be used to predict which patients will survive for the duration of the study (predictive). These two limitations will be addressed in the survival analysis and predictive statistics sections.
317
318
```Python
319
>>> f = lambda x:   100 * (  x /df.shape[0] )
320
>>> df['dummy'] = 1;
321
>>> df.groupby(['PCR','Alive']).count()['dummy'].apply(f)
322
323
PCR  Alive
324
No   No       16.666667
325
     Yes      56.547619
326
Yes  No        2.380952
327
     Yes      24.404762
328
```
329
## 6. Predictive Statistics
330
The objective of predictive_statistics is not make inferences about a patient population but making predictions for individual patients. Which at the end is what we want if medicine is going to personalized. In terms of the machine learning nomenclature, predictive is statistics is a supervised learning method, because we know features and outcomes for a particular collection of observations. Mainly, the observed outcomes are: 1) categorical, where we need to predictive if a particular patient belong to one or more groups (Alive  vs Not Alive, Cure vs not Cured, etc.), 2) continous, where we want to predictive the value of a continous variable for a patient. The analysis presented here is divided in the same manner. A toolbox to perform this analysis can be found at the Github [repository](https://github.com/JCardenasRdz/Insights-into-the-I-SPY-clinical-trial/blob/master/ispy1/predictive_statistics.py).
331
332
### 6. 1 Prediction of categorical outcomes
333
**Pathological Complete Response (`PCR`)**   
334
Unbalanced samples are one of the main challenges in supervised classification learning. An unbalance sample when the number of observations in class of `N` possible outcomes makes the majority of the observations. Unfortunately, this imbalance is very common in health care, thus we need to confirm how unbalanced our sample for `PCR` is:
335
336
```Python
337
# modules
338
import pandas as pd
339
import matplotlib.pyplot as plt
340
# data
341
df = pd.read_csv('./data/I-SPY_1_clean_data.csv')
342
# plot
343
# check how unbalanced the data are
344
outcome = 'PCR'
345
df[outcome].value_counts(normalize = True).plot.barh();
346
plt.title('Normalized Sample size for PCR')
347
plt.xlabel('group size (%)');
348
plt.show();
349
```
350
![PCR balance](./images/Sample_Size_PCR.png)
351
352
We can see that 70% of the patients did not achieve `PCR`, is almost 2X more unbalanced than we want (50% split is ideal). A way to deal with this imbalance is to oversample the class with the fewest observations. This oversampling can be performed using the [imbalanced-learn python library](http://contrib.scikit-learn.org/imbalanced-learn/install.html), but first we need to allocate predictors and outcomes to match the format needed by _scikit-learn_:
353
```Python
354
# package by Julio
355
from ispy1 import predictive_statistics
356
357
# allocate continous predictors
358
cont_predictors = ['age','MRI_LD_Baseline', 'MRI_LD_1_3dAC', 'MRI_LD_Int_Reg', 'MRI_LD_PreSurg']
359
X_cont = df[cont_predictors].values
360
361
# allocate clinical predictors
362
cat_predictors = ['White', 'ER+', 'PR+', 'HR+'];
363
X_cat = pd.pandas.get_dummies(df[cat_predictors], drop_first=True).values
364
365
# allocate a single predictors matrix X
366
X = np.concatenate( (X_cont, X_cat), axis=1)
367
368
# allocate  outcome
369
outcome = 'PCR'
370
y = predictive_statistics.labels_to_numbers(df, outcome);    
371
```
372
Now, we can perform `Logistic Regression` on the combined categorical and continous predictors. I used the `GridSearchCV` module to find the logistic regression parameters that maximize `Cohen's Kappa` (which is a good metric for unbalanced samples).
373
374
**why kappa is a good metric for classifiers**   
375
A possible scenario is to have a classifier that predicts that 44 / 45 `PCR` belong to the class with the largest number of cases (`PCR` = No). As you can see in the following code, `AUC` and `accuracy` are very misleading metrics to measure the performance of a classifier, because
376
377
```Python
378
from sklearn import metrics
379
def mymetrics(ypredicted, yexpected):
380
    print(metrics.classification_report(ypredicted, yexpected))
381
    k = metrics.cohen_kappa_score(ypredicted, yexpected); k = np.round(k,3);
382
    auc = metrics.roc_auc_score(ypredicted, yexpected);   auc  = np.round(auc,3);
383
    accuracy = metrics.accuracy_score(ypredicted, yexpected); accuracy = np.round(accuracy,3);
384
385
    print("Kappa = " + str(k))
386
    print("AUC = " + str(auc))
387
    print("Accuracy = " + str(accuracy))
388
389
# make at least one observation positive
390
>>> y_crazy = np.zeros_like(y)
391
>>> y_crazy[np.argwhere(y>0)[0]] = 1
392
>>> mymetrics(y_crazy, y)
393
394
precision    recall  f1-score   support
395
396
         0       1.00      0.74      0.85       167
397
         1       0.02      1.00      0.04         1
398
399
avg / total       0.99      0.74      0.84       168
400
401
Kappa = 0.032
402
AUC = 0.868
403
Accuracy = 0.738
404
```
405
406
The following code snippet shows the `Logistic_Regression` function inside `ispy1.predictive_statistics`.
407
408
```Python
409
def Logistic_Regression(Xdata, Ydata, oversample = False, K_neighbors = 4):
410
    '''
411
    Perform Logistic Regression optimizing C, penalty, and fit_intercept to maximize
412
    Cohen kappa (min  = -1, max = 1.0)
413
    '''
414
415
    # split data
416
    X_train, X_test, y_train, y_test = split_data(Xdata,Ydata, oversample, K_neighbors)
417
418
    # train and tune parameters using GridSearchCV
419
    pars= dict(   C = np.arange(.01,100,.1),
420
                  penalty = ['l2', 'l1'],
421
                  fit_intercept = [True,False])
422
423
    grid =  GridSearchCV(  linear_model.LogisticRegression(), param_grid = pars,
424
                           scoring = metrics.make_scorer(metrics.cohen_kappa_score),
425
                           cv= 5, verbose = 0, n_jobs = -1)
426
427
    # fit
428
    grid.fit(X_train,y_train)
429
430
    # metrics
431
    auc, kappa, fpr, tpr = binary_classifier_metrics(grid, X_train, y_train, X_test, y_test)
432
433
    # output
434
    return auc, kappa, fpr, tpr
435
```
436
Running `Logistic Regression` without oversampling yields the following results for `PCR`:
437
438
```Python
439
>>> auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(X, y, oversample = False)
440
precision    recall  f1-score   support
441
442
  0       0.77      0.87      0.82        39
443
  1       0.29      0.17      0.21        12
444
445
avg / total       0.66      0.71      0.68        51
446
447
The estimated Cohen kappa is 0.0449438202247
448
The estimated AUC is 0.519
449
============================================================
450
```
451
Because `AUC` is close to 0.50 and `kappa` is close to 0.0, we can say that these results indicate that the observed performance of  logistic regression is very close to what we would expect by just flipping a coin (50% chance).
452
On the other hand, running logistic regression with oversampling yields a better result:
453
454
```Python
455
>>> auc2, kappa2, fpr2, tpr2 = predictive_statistics.Logistic_Regression(X, y, oversample = True, K_neighbors = 4)
456
Data was oversampled using the ADASYN method
457
             precision    recall  f1-score   support
458
459
          0       0.82      0.72      0.77        39
460
          1       0.35      0.50      0.41        12
461
462
avg / total       0.71      0.67      0.68        51
463
464
The estimated Cohen kappa is 0.190476190476
465
The estimated AUC is 0.609
466
============================================================
467
```
468
The details for the implementation of oversampling by ADASYN can be found in the `split_data` function of `ispy1.predictive_statistics`. Next, we can compare the ROC curves for the oversampled and and standard logistic regression of `PCR`:
469
470
```Python
471
title ='Effect of oversampling on Logistic Regression for PCR'
472
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
473
```
474
![pcr_logistic_compare](./images/PCR_Logistic.png)
475
476
A random forest classifier was also implemented with and without oversampling. Please see `predictive_statistics.RandomForest_Classifier` for further details.
477
```Python
478
>>> auc1, kappa1, fpr1, tpr1, forest = predictive_statistics.RandomForest_Classifier(X, y)
479
precision    recall  f1-score   support
480
481
         0       0.80      0.92      0.86        39
482
         1       0.50      0.25      0.33        12
483
484
avg / total       0.73      0.76      0.73        51
485
486
The estimated Cohen kappa is 0.209302325581
487
The estimated AUC is 0.587
488
============================================================
489
490
>>> auc2, kappa2, fpr2, tpr2, Forest = predictive_statistics.RandomForest_Classifier(X, y, oversample=True, K_neighbors = 2)
491
492
Data was oversampled using the ADASYN method
493
             precision    recall  f1-score   support
494
495
          0       0.81      0.90      0.85        39
496
          1       0.50      0.33      0.40        12
497
498
avg / total       0.74      0.76      0.75        51
499
500
The estimated Cohen kappa is 0.260869565217
501
The estimated AUC is 0.615
502
============================================================
503
```
504
These results indicate that oversampling has a very small benefit for random forest classifiers.
505
506
507
**Survival (`Alive`)**   
508
A similar approach can be used for survival, and we can see that `Alive` is even more unbalanced than `PCR`. With about 80% of the subjects surviving until the end of the study.
509
510
```Python
511
# allocate  outcome
512
>>> outcome = 'Alive'
513
>>> y = predictive_statistics.labels_to_numbers(df, outcome);
514
515
# check how unbalanced the data are
516
>>> df[outcome].value_counts(normalize = True).plot.barh();
517
>>> plt.title('Normalized Sample size for Alive')
518
>>> plt.xlabel('group size (%)')
519
```
520
![alive_sample](./images/Sample_Size_Alive.png)
521
522
As before, we can run use logistic regression and random forest to predict survival [`Alive`].
523
524
```Python
525
# standard Logistic Regression
526
>>> auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(X, y)
527
528
precision    recall  f1-score   support
529
530
         0       0.50      0.27      0.35        11
531
         1       0.82      0.93      0.87        40
532
533
avg / total       0.75      0.78      0.76        51
534
535
The estimated Cohen kappa is 0.236734693878
536
The estimated AUC is 0.599
537
============================================================
538
539
# unbalanced learning
540
>>> auc2, kappa2, fpr2, tpr2 = predictive_statistics.Logistic_Regression(X, y, oversample=True, K_neighbors = 4)
541
542
Data was oversampled using the ADASYN method
543
             precision    recall  f1-score   support
544
545
          0       0.36      0.45      0.40        11
546
          1       0.84      0.78      0.81        40
547
548
avg / total       0.73      0.71      0.72        51
549
550
The estimated Cohen kappa is 0.208893485005
551
The estimated AUC is 0.615
552
============================================================
553
554
>>> title ='Effect of oversampling on Logistic Regression for Alive'
555
>>> predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
556
```
557
![alive_log](./images/Alive_Logistic.png)
558
559
while the code and results for a random forest classifier are:
560
561
```Python
562
>>> auc1, kappa1, fpr1, tpr1, _= predictive_statistics.RandomForest_Classifier(X, y)
563
precision    recall  f1-score   support
564
565
          0       0.38      0.27      0.32        11
566
          1       0.81      0.88      0.84        40
567
568
avg / total       0.72      0.75      0.73        51
569
570
The estimated Cohen kappa is 0.16393442623
571
The estimated AUC is 0.574
572
============================================================
573
574
>>> auc2, kappa2, fpr2, tpr2, _= predictive_statistics.RandomForest_Classifier(X, y, oversample=True, K_neighbors = 6)
575
Data was oversampled using the ADASYN method
576
             precision    recall  f1-score   support
577
578
          0       0.67      0.36      0.47        11
579
          1       0.84      0.95      0.89        40
580
581
avg / total       0.81      0.82      0.80        51
582
583
The estimated Cohen kappa is 0.375510204082
584
The estimated AUC is 0.657
585
============================================================
586
587
>>> title ='Effect of oversampling on RFC for PCR'
588
>>> predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
589
```
590
![alive_log](./images/Alive_RFC.png)
591
592
oversampling does not have an effect on the average results and causes an important increment in the precision, recall, and for the negative group, and improves AUC and kappa.   
593
594
**Survival (`Alive`) including `PCR` as predictor**
595
The most important clinical outcome in cancer is how long a patient lives after being treated, or if he/she is alive at the end of a study, with that in mind I included `PCR` as a predictor of survival. This makes sense given the study design.
596
597
- Logistic Regression
598
```Python
599
# allocate new predictor variable
600
>>> pcr = predictive_statistics.labels_to_numbers(df, 'PCR').reshape(168,1)
601
>>> newX = np.concatenate((X,pcr), axis  = 1)
602
# standard LG
603
>>> auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(newX, y)
604
605
precision    recall  f1-score   support
606
607
         0       0.43      0.27      0.33        11
608
         1       0.82      0.90      0.86        40
609
610
avg / total       0.73      0.76      0.74        51
611
612
The estimated Cohen kappa is 0.198952879581
613
The estimated AUC is 0.586
614
============================================================
615
616
>>> # unbalanced learning
617
>>> auc2, kappa2, fpr2, tpr2 = predictive_statistics.Logistic_Regression(newX, y, oversample=True, K_neighbors = 10)
618
619
Data was oversampled using the ADASYN method
620
             precision    recall  f1-score   support
621
622
          0       0.38      0.45      0.42        11
623
          1       0.84      0.80      0.82        40
624
625
avg / total       0.74      0.73      0.73        51
626
627
The estimated Cohen kappa is 0.238805970149
628
The estimated AUC is 0.627
629
============================================================
630
```
631
- RandomForest_Classifier
632
```Python
633
>>>  auc1, kappa1, fpr1, tpr1 , _ = predictive_statistics.RandomForest_Classifier(newX, y)
634
635
precision    recall  f1-score   support
636
637
          0       0.33      0.09      0.14        11
638
          1       0.79      0.95      0.86        40
639
640
avg / total       0.69      0.76      0.71        51
641
642
The estimated Cohen kappa is 0.0555555555556
643
The estimated AUC is 0.52
644
============================================================
645
646
>>> auc2, kappa2, fpr2, tpr2, _ = predictive_statistics.RandomForest_Classifier(newX, y, oversample=True, K_neighbors = 10)
647
648
Data was oversampled using the ADASYN method
649
             precision    recall  f1-score   support
650
651
          0       0.38      0.45      0.42        11
652
          1       0.84      0.80      0.82        40
653
654
avg / total       0.74      0.73      0.73        51
655
656
The estimated Cohen kappa is 0.238805970149
657
The estimated AUC is 0.627
658
============================================================
659
```
660
661
As it can be seen above, including `PCR` only does not provide a significant improvement for Logistic nor RandomForest Classifier. Thus, if we to include information about surgery, it might be better to include the residual cancer burden `RCB`.
662
663
**Survival (`Alive`) including `RCB` as a predictor**
664
First, we need to create a new matrix of predictors with `RCB` append at the end:
665
666
```Python
667
>> rcb = pd.get_dummies(df['RCB']).values
668
>>> newX = np.concatenate((X,rcb), axis  = 1)
669
```
670
next, we can perform logistic regression and random forest classification without oversampling.
671
672
```Python
673
>>> auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(newX, y)
674
675
precision    recall  f1-score   support
676
677
          0       0.50      0.36      0.42        11
678
          1       0.84      0.90      0.87        40
679
680
avg / total       0.76      0.78      0.77        51
681
682
The estimated Cohen kappa is 0.292559899117
683
The estimated AUC is 0.632
684
============================================================
685
686
>>> auc2, kappa2, fpr2, tpr2, _= predictive_statistics.RandomForest_Classifier(newX, y)
687
precision    recall  f1-score   support
688
689
          0       0.50      0.45      0.48        11
690
          1       0.85      0.88      0.86        40
691
692
avg / total       0.78      0.78      0.78        51
693
694
The estimated Cohen kappa is 0.340775558167
695
The estimated AUC is 0.665
696
============================================================
697
```
698
The random forest classifier performs slightly better than logistic regression but not much, their ROC curves compare as follows:
699
```Python
700
# compare
701
>>> title ='Predicting Survival including RCB as a predictor'
702
>>> predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
703
>>> plt.legend(['Logistic Regression','Random Forest Classifier']);
704
```
705
![compare_rcb_log_vs_rfc](./images/LG_vs_RFC_Alive_RCB.png)
706
707
What we can do next is to test the effect of oversampling using ADASYN:
708
709
```Python
710
>>> auc3, kappa3, fpr3, tpr3 = predictive_statistics.Logistic_Regression(newX, y, oversample=True, K_neighbors = 4)
711
712
Data was oversampled using the ADASYN method
713
             precision    recall  f1-score   support
714
715
          0       0.33      0.45      0.38        11
716
          1       0.83      0.75      0.79        40
717
718
avg / total       0.73      0.69      0.70        51
719
720
The estimated Cohen kappa is 0.180722891566
721
The estimated AUC is 0.602
722
============================================================
723
724
>>> auc4, kappa4, fpr4, tpr4, _= predictive_statistics.RandomForest_Classifier(newX, y, oversample=True, K_neighbors = 4)
725
Data was oversampled using the ADASYN method
726
             precision    recall  f1-score   support
727
728
          0       0.30      0.27      0.29        11
729
          1       0.80      0.82      0.81        40
730
731
avg / total       0.70      0.71      0.70        51
732
733
The estimated Cohen kappa is 0.101057579318
734
The estimated AUC is 0.549
735
============================================================
736
```
737
738
As we observed for other cases, including oversampling using ADASYN does not improve things much or in this case, makes it a little worse, suggesting that this approach does not generalize properly.    
739
740
### 6. 2 Prediction of continous outcomes
741
The analysis of survival in the previous section was aimed at predicting if a patient will live to the end of the study, but that does not tell us how long they lived (which is what matters), nor how long the patient was relieved from cancer. These two questions are tackled in the following sections. But first, we need to organize the data to perform the analysis.
742
743
```Python
744
>>> cont_predictors = ['age','MRI_LD_Baseline', 'MRI_LD_1_3dAC', 'MRI_LD_Int_Reg', 'MRI_LD_PreSurg']
745
>>> contX = df[cont_predictors].values
746
>>> cat_pred = ['PCR','White', 'ER+', 'PR+', 'HR+'];
747
>>> catX = pd.pandas.get_dummies(df[cat_pred], drop_first=True).values
748
>>> X = np.concatenate( (catX, contX), axis=1)
749
```
750
751
**Recurrence-Free Survival (`RFS`, Continous in months)**   
752
Recurrence-Free Survival (`RFS`) is the length of time after primary treatment for a cancer ends that the patient survives without any signs or symptoms of that cancer. The following regression methods will be used to construct a model to predict `RFS` after splitting the data in train and test:
753
754
1. Ordinary Linear Least square fitting
755
2. ElasticNet
756
3. SVM regression
757
4. Random Forest regression
758
759
as we did in the previous section, all predictors were fine tuned using `RandomizedSearchCV` for regressor 2 to 4, but oversampling is not needed for regression. The code an outputs are shown next
760
761
```Python
762
>>> y = df.RFS.values / 30; # conver to months
763
>>> X_train, X_test, y_train, y_test = predictive_statistics.split_data(X, y, False)
764
```
765
766
- Ordinary Linear Least square fitting
767
768
```Python
769
>>> predictive_statistics.lsq(X_train, y_train, X_test, y_test, outcome =' Recurrence-Free Survival (months)')
770
OLS Regression Results                            
771
==============================================================================
772
Dep. Variable:                      y   R-squared:                       0.870
773
Model:                            OLS   Adj. R-squared:                  0.858
774
Method:                 Least Squares   F-statistic:                     71.68
775
Date:                Tue, 20 Jun 2017   Prob (F-statistic):           8.95e-43
776
Time:                        08:39:03   Log-Likelihood:                -502.27
777
No. Observations:                 117   AIC:                             1025.
778
Df Residuals:                     107   BIC:                             1052.
779
Df Model:                          10                                         
780
Covariance Type:            nonrobust                                         
781
==============================================================================
782
783
================================================================================
784
The median absolute error for testing data set of  Recurrence-Free Survival (months) is: 14.354
785
================================================================================
786
```
787
788
- ElasticNet
789
790
```Python
791
>>> predictive_statistics.ElasticNet(X_train, y_train, X_test, y_test, outcome =' Recurrence-Free Survival (months)')
792
================================================================================
793
The median absolute error for testing data set of  RFS (months) is: 12.737
794
================================================================================
795
```
796
797
- SVM Regression
798
799
```Python
800
>>> predictive_statistics.svr(X_train, y_train, X_test, y_test, outcome =' RFS(months)')
801
802
================================================================================
803
The median absolute error for testing data set of  RFS(months) is: 12.288
804
================================================================================
805
```
806
807
808
809
- Random Forest Regressor
810
811
```Python
812
>> predictive_statistics.RandomForestRegressor(X_train, y_train, X_test, y_test, outcome =' RFS(months)')
813
================================================================================
814
The median absolute error for testing data set of  RFS(months) is: 14.88
815
================================================================================
816
```
817
818
819
**`Survival Length` (continous in months)**   
820
The same approach used for `RFS` can be used to construct a model to predict how long the patient will leave after the complete
821
cycle of therapy has been completed. The results are summarized in the table below.
822
823
| Method            | Mean Absolute Error Survival |
824
|-------------------|------------------------------|
825
| Linear Regression | 15.82                        |
826
| SVR               | 102.82                       |
827
| ElasticNet        | 9.136                        |
828
| RandomForest      | 9.254                        |
829
830
As before, linear regression performs really well despite its simplicity, and its error is ~56% than more complex methods such as `ElasticNet` and `RandomForest`. `ElasticNet` is a linear model that combines the L1 and L2 penalties of the lasso and ridge methods.