a b/presentationGenerator.Rmd
1
---
2
title: "Penalized logistic regression on time-to-event data using casebase sampling"
3
author: "Jesse Islam"
4
date: "2/27/2020"
5
mathspec: true
6
theme: "metropolis"
7
output:
8
  beamer_presentation:
9
    incremental: FALSE
10
    fig_caption: false
11
header-includes:
12
  \usepackage{xcolor}
13
---
14
15
```{r setup, include=FALSE}
16
source("packages.R")
17
set.seed(1)
18
knitr::opts_chunk$set( warning=FALSE,message=FALSE,echo=FALSE)
19
knitr::opts_chunk$set(dev = 'pdf')
20
```
21
22
23
24
# Popular methods in time-to-event analysis 
25
26
* In disease etiology, we tend to make use of the proportional hazards hypothesis.
27
    * Cox Regression
28
29
* When we want the absolute risk:
30
    * Breslow estimator
31
    * Parametric models
32
    
33
# Motivations for a new method
34
35
* Julien and Hanley found that survival analysis rarely produces prognostic functions, even though the software is widely available in cox regression packages. [2]
36
* They believe the stepwise nature is the reason, as it reduces interpretability. [2]
37
* A streamlined approach for reaching a **smooth absolute risk** curve. [2]
38
39
# Dr. Cox's perspective
40
41
![](coxquote.png)
42
43
# Itinerary
44
45
1. SUPPORT study
46
2. Casebase sampling
47
3. Penalized logistic regression on survival data
48
4. Maximum likelihood with regularization
49
5. Absolute risk comparison
50
6. Conclusion
51
7. Future work
52
8. References
53
54
# SUPPORT dataset [4]
55
56
* **Study to Understand Prognoses and Preferences for Outcomes and Risks Treatments**
57
* Design: Prospective cohort study.
58
* Setting: 5 academic care centers in the United States.
59
* Participants: 9105 hospitalized.
60
* Follow-up-time: 5.56 years.
61
* 68% incidence rate.
62
63
# SUPPORT manual imputation [4]
64
65
* Notorious for missing data
66
\begin{table}[]
67
\centering
68
\resizebox{\textwidth}{!}{%
69
\begin{tabular}{ll}
70
\hline
71
Baseline Variable      & Normal Fill-in Value \\ \hline
72
Bilirubin              & 1.01                 \\
73
BUN                    & 6.51                 \\
74
Creatinine             & 1.01                 \\
75
PaO2/FiO2 ratio (pafi) & 333.3                \\
76
Serum albumin          & 3.5                  \\
77
Urine output           & 2502                 \\
78
White blood count      & 9 (thousands)        \\ \hline
79
\end{tabular}%
80
}
81
\caption{Suggested imputation values. {[}3{]}}
82
\label{tab:my-table}
83
\end{table}
84
85
# SUPPORT automated imputation
86
87
* Mice imputation package (R)
88
89
1. PMM (Predictive Mean Matching)  – For numeric variables
90
2. logreg(Logistic Regression) – Binary Variables 
91
3. polyreg(Bayesian polytomous regression) Factor Variables
92
93
# Removed variables [4]
94
95
* **Response variables**
96
  * Hospital Charges.
97
  * Patient ratio of costs to charges.
98
  * Patient Micro-costs.
99
* **Ordinal covariates**
100
  * functional disability.
101
  * Income.
102
* **Sparse covariate**
103
  * Surrogate activities of daily living.
104
* Previous model results. (6)
105
106
# Variable overview [4]
107
108
* **Response variables**
109
  * follow-up time, death.
110
111
* **Covariates**
112
  * Age, sex, race, education (6)
113
  * Disease group/class, comorbidities. (3)
114
  * Coma score, Therapeutic Intervention Scoring System (2)
115
  * Physiological variables. (11)
116
  * Activities of daily living. (2)
117
118
119
120
121
# Original SUPPORT analysis [4]
122
123
* Determined SUPPORT prognostic model on phase I patients.
124
* Tested on Phase II.
125
* Both on the scale of 180 days.
126
127
# Original SUPPORT analysis [4]
128
129
SUPPORT physiology score (SPS) was developed.
130
```{r, out.width = "80%",include=TRUE}
131
knitr::include_graphics("SPS.png")
132
```
133
134
135
136
# SUPPORT: my question
137
138
* How does their SPS perform over 5.56 years?
139
* How does the Apache III physiology score (APS) perform over 5.56 years?
140
* How does a model with all the covariates, excluding SPS and APS, perform?
141
142
# Analysis Process
143
144
1. Impute
145
2. Compare SPS and APS over ~5.56 years using absolute risk.
146
3. Compare to Kaplan-Meier curve (unadjusted model).
147
4. Compare to full model (excluding SPS and APS).
148
149
* All models is trained on 80% of the observations.
150
* Remaining observations are used to generate comparative absolute risk curves.
151
  * The absolute risk curve for each individual is averaged.
152
  * Averaged curve is expected to approach Kaplan-meier curve.
153
154
```{r data processing,include=FALSE}
155
156
#DATA retrieved from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/support2csv.zip
157
ew <- read_csv("support2.csv")
158
data<-read.csv('support2.csv', col.names=colnames(ew), fill=TRUE, header=TRUE)
159
rm(ew)
160
```
161
162
163
```{r cleaningImputation,include=FALSE}
164
#manual imputation
165
data$alb[is.na(data$alb)]=3.5
166
data$pafi[is.na(data$pafi)]=333.3
167
data$bili[is.na(data$bili)]=1.01
168
data$crea[is.na(data$crea)]=1.01
169
data$bun[is.na(data$bun)]=6.51
170
data$wblc[is.na(data$wblc)]=9
171
data$urine[is.na(data$urine)]=2502
172
173
# automated imputation
174
previousModelPositions<-c(18,19,20,21,22,23,24,25,26,27,28,29)
175
176
supportMain<-data[,-previousModelPositions]
177
supportPrevious<-data[,previousModelPositions]
178
#keep 1,2 after imputation from previous usable models.
179
180
#missingPattern<-md.pattern(supportMain)
181
supportMainImpute <-mice(supportMain[,-c(11,34,13,14,15)], m = 1,printFlag=FALSE) #impute while removing ordinal variables and other response variables.
182
imputedData<-cbind(complete(supportMainImpute,1),supportPrevious[,c(1,2)])
183
#adls is missing 33%, and is removed accordingly and hospitaldeath
184
#md.pattern(completeData)
185
completeData<-na.omit(imputedData[,-c(4,29)])
186
workingCompleteData=model.matrix(death~ .-d.time,data=completeData)[,-c(1)]#remove intercept
187
#Create u and xCox, which will be used when fitting Cox with glmnet.
188
newData=workingCompleteData
189
x=workingCompleteData
190
#x and y will be used to fit the casebase model under glmnet.
191
#x=as.matrix(sparse.model.matrix(death~ .-d.time+0,data=workingData))
192
y=data.matrix(completeData[,c(2,5)])
193
```
194
195
```{r CoxExplorationModels,include=FALSE, eval = TRUE}
196
fullDataCox<-as.data.frame(cbind(as.numeric(y[,2]),as.numeric(y[,1]),x))
197
train_index <- sample(1:nrow(fullDataCox), 0.8 * nrow(fullDataCox))
198
test_index <- setdiff(1:nrow(fullDataCox), train_index)
199
200
train<-fullDataCox[train_index,]
201
test<-fullDataCox[test_index,]
202
coxSPS<- survival::coxph(Surv(time=V1,event=V2) ~ sps, data = train)
203
abcoxSPS<-survival::survfit(coxSPS,newdata = test)
204
205
coxAPS<- survival::coxph(Surv(time=V1,event=V2) ~ aps, data = train)
206
abcoxAPS<-survival::survfit(coxAPS,newdata = test)
207
208
coxFull<- survival::coxph(Surv(time=V1,event=V2) ~ .-sps -aps, data = train)
209
abcoxFull<-survival::survfit(coxAPS,newdata = test)
210
211
KM<- survival::coxph(Surv(time=V1,event=V2) ~ 1, data = test)
212
abKM<-survival::survfit(KM)
213
abKMcompare<-cbind(abKM$time[abKM$time<=1631],1-abKM$surv[abKM$time<=1631])
214
colnames(abKMcompare)<-c("Time","KM")
215
216
baseCoxComparisonData<-cbind(abcoxSPS$time[abcoxSPS$time<=1631],1-rowMeans(abcoxSPS$surv[abcoxSPS$time<=1631,]),1-rowMeans(abcoxAPS$surv[abcoxSPS$time<=1631,]),1-abKM$surv,1-rowMeans(abcoxFull$surv[abcoxSPS$time<=1631,]))
217
colnames(baseCoxComparisonData)<-c("Time","SPS","APS","Unadjusted","Full")
218
myColors <- brewer.pal(5,"Dark2")
219
FigureWidth='90%'
220
```
221
222
223
# SPS vs APS
224
```{r SPSvsAPS,include=TRUE, out.width = FigureWidth}
225
#Full
226
ggplot(as.data.frame(baseCoxComparisonData), aes(Time)) + 
227
  geom_line(aes(y = SPS, colour = "SPS"),lwd=2) +
228
  geom_line(aes(y = APS, colour = "APS"),lwd=2) +
229
  scale_colour_manual(name = "Models",values = myColors[c(3,4)])+
230
  labs(y= "Probability of Death", x = "Survival time (Days)")
231
```
232
233
# SPS vs. Kaplan-Meier
234
```{r SPSvsKM,include=TRUE, out.width = FigureWidth}
235
ggplot(as.data.frame(baseCoxComparisonData), aes(Time)) + 
236
  geom_line(aes(y = SPS, colour = "SPS"),lwd=2) +
237
  geom_line(aes(y = APS, colour = "APS"),lwd=2) +
238
  geom_step(data=as.data.frame(abKMcompare),mapping=aes(x=Time,y=KM,color="Unadjusted"),lwd=2)+
239
  scale_colour_manual(name = "Models",values = myColors[c(3,4,2)])+
240
  labs(y= "Probability of Death", x = "Survival time (Days)")
241
```
242
243
# All covariates vs. physiology scores vs unadjusted
244
```{r FullvsKM,include=TRUE, out.width = FigureWidth}
245
ggplot(as.data.frame(baseCoxComparisonData), aes(Time)) + 
246
  geom_line(aes(y = SPS, colour = "SPS"),lwd=2) + 
247
  geom_line(aes(y = APS, colour = "APS"),lwd=2) +
248
  geom_step(data=as.data.frame(abKMcompare),mapping=aes(x=Time,y=KM,color="Unadjusted"),lwd=2)+
249
  geom_line(aes(y = Full, col = "All Covariates without SPS APS"),lwd=2)+
250
  scale_colour_manual(name = "Models",values = myColors[c(1,3,4,2)])+
251
  labs(y= "Probability of Death", x = "Survival time (Days)")
252
```
253
254
255
# Chosen absolute risk comparisons
256
```{r ,include=TRUE, out.width = FigureWidth}
257
ggplot(as.data.frame(baseCoxComparisonData), aes(Time)) + 
258
  geom_step(data=as.data.frame(abKMcompare),mapping=aes(x=Time,y=KM,color="Unadjusted"),lwd=2)+
259
  geom_line(aes(y = Full, colour = "Full"),lwd=2)+
260
  scale_colour_manual(name = "Models",values = myColors[c(1,2)])+
261
  labs(y= "Probability of Death", x = "Survival time (Days)")
262
```
263
264
# Chosen absolute risk comparisons: conclusion
265
266
* Linear associations without physiology scores perform similarly to SPS and APS alone.
267
* We choose the linear associations without physiology scores as the model of choice (Full model). 
268
269
# Casebase sampling overview
270
271
1. Clever sampling.
272
2. Implicitly deals with censoring.
273
3. Allows a parametric fit using *logistic regression*.
274
275
* Casebase is parametric, and allows different parametric fits by incorporation of the time component.
276
* Package contains an implementation for generating *population-time* plots.
277
278
# Casebase: Sampling
279
280
```{r DataModel, echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth}
281
nobs <- nrow(data)
282
ftime <- data$d.time
283
ord <- order(ftime, decreasing = FALSE)
284
#ordSample=order(sample(ord,1000))
285
yCoords <- cbind(cumsum(data[ord, "death"] == 1),
286
cumsum(data[ord, "death"] == 0))
287
yCoords <- cbind(yCoords, nobs - rowSums(yCoords))
288
aspectRatio <- 0.75
289
height <- 8.5 * aspectRatio; width <- 11 * aspectRatio
290
cases <- data$death == 1
291
comps <-data$death == 2
292
# randomly move the cases vertically
293
moved_cases <- yCoords[cases[ord], 3] * runif(sum(cases))
294
moved_comps <- yCoords[comps[ord], 3] * runif(sum(comps))
295
moved_bases <- yCoords[cases[ord], 3] * runif(sum(cases))
296
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs),
297
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
298
ptsz=0.5
299
```
300
301
# Casebase: Sampling
302
```{r , echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth}
303
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs),
304
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
305
polygon(c(0, 0, ftime[ord], max(ftime), 0),
306
c(0, nobs, yCoords[,3], 0, 0),
307
col = "grey60")
308
legend("topright", legend = c("Base"),
309
col = c("grey60"),
310
pch = 19,cex=1.5)
311
```
312
313
# Casebase: Sampling
314
```{r , echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth}
315
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs),
316
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
317
polygon(c(0, 0, ftime[ord], max(ftime), 0),
318
c(0, nobs, yCoords[,3], 0, 0),
319
col = "grey60")
320
points((ftime[ord])[cases[ord]], moved_bases, pch = 19,
321
col = "blue", cex = ptsz)
322
legend("topright", legend = c("Base","base-series"),
323
col = c("grey60","blue"),
324
pch = 19,cex=1.5)
325
```
326
327
# Casebase: Sampling
328
```{r , echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth}
329
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs),
330
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
331
polygon(c(0, 0, ftime[ord], max(ftime), 0),
332
c(0, nobs, yCoords[,3], 0, 0),
333
col = "grey60")
334
points((ftime[ord])[cases[ord]], moved_bases, pch = 19,
335
col = "blue", cex = ptsz)
336
points((ftime[ord])[cases[ord]], yCoords[cases[ord],3], pch = 19,
337
col = "firebrick3", cex = ptsz)
338
legend("topright", legend = c("Base","base-series","case-series (Death)"),
339
col = c("grey60","blue","firebrick3"),
340
pch = 19,cex=1.5)
341
```
342
343
# Casebase: Sampling
344
345
```{r , echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth}
346
347
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs),
348
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
349
polygon(c(0, 0, ftime[ord], max(ftime), 0),
350
c(0, nobs, yCoords[,3], 0, 0),
351
col = "grey60")
352
points((ftime[ord])[cases[ord]], moved_bases, pch = 19,
353
col = "blue", cex = ptsz)
354
points((ftime[ord])[cases[ord]], moved_cases, pch = 19,
355
col = "firebrick3", cex = ptsz)
356
#points((ftime[ord])[bases[ord]], moved_bases, pch = 19,
357
#col = "blue", cex = ptsz)
358
legend("topright", legend = c("Base","base-series","case-series (Death)"),
359
col = c("grey60", "blue","firebrick3"),
360
pch = 19,cex=1.5)
361
362
363
```
364
365
# Casebase: Sampling
366
367
$$e^{L}=\frac{Pr(Y=1|x,t)}{Pr(Y=0|x,t)}=\frac{h(x,t)*B(x,t)}{b[B(x,t)/B]}=\frac{h(x,t)*B}{b}$$
368
369
* $L=\beta X$
370
* $b$ = base-series.
371
* $B$ = Base.
372
* *B(x,t)* =  Risk-set for survival time $t$.
373
374
375
# Casebase: Sampling
376
377
$e^{L}=\frac{Pr(Y=1|x,t)}{Pr(Y=0|x,t)}=\frac{h(x,t)*B(x,t)}{b[B(x,t)/B]}=\frac{h(x,t)*B}{b}$
378
379
```{r , echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth}
380
381
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs),
382
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5)
383
polygon(c(0, 0, ftime[ord], max(ftime), 0),
384
c(0, nobs, yCoords[,3], 0, 0),
385
col = "grey60")
386
points((ftime[ord])[cases[ord]], moved_bases, pch = 19,
387
col = "blue", cex = ptsz)
388
points((ftime[ord])[cases[ord]], moved_cases, pch = 19,
389
col = "firebrick3", cex = ptsz)
390
#points((ftime[ord])[bases[ord]], moved_bases, pch = 19,
391
#col = "blue", cex = ptsz)
392
abline(v=990,lwd=2)
393
abline(v=1010,lwd=2)
394
legend("topright", legend = c("Base","base-series","case-series (Death)"),
395
col = c("grey60", "blue","firebrick3"),
396
pch = 19,cex=1.5)
397
```
398
399
400
401
# log-odds = log hazard
402
403
 $$e^{L}=\frac{\hat{h}(x,t)*B}{b} $$
404
 $$ \hat{h}(x,t)= \frac{b*e^{L}}{B}$$
405
 $$log(\hat{h}(x,t))= L+log(\frac{b}{B})$$
406
407
408
409
410
411
# Maximum log-likelihood [1]
412
413
 $$  log(l(\beta_{0},\beta))=\frac{1}{N}\sum^{N}_{i=1}\{y_{i}(\beta_{0}+x_{i}^{T}\beta)-log(1+e^{\beta_{0}+x_{i}^{T}\beta})\}  $$
414
 
415
# Maximum log-likelihood, with offset
416
 $$  log(l(\textcolor{blue}{log(\frac{b}{B})},\beta))=\frac{1}{N}\sum^{N}_{i=1}\{y_{i}(\textcolor{blue}{log(\frac{b}{B})}+x_{i}^{T}\beta)-log(1+e^{\textcolor{blue}{log(\frac{b}{B})}+x_{i}^{T}\beta})\}  $$
417
418
# Maximum log-likelihood, with offset and lasso
419
420
421
$\frac{1}{N}\sum^{N}_{i=1}\{y_{i}(\textcolor{blue}{log(\frac{b}{B})}+x_{i}^{T}\beta)-log(1+e^{\textcolor{blue}{log(\frac{b}{B})}+x_{i}^{T}\beta})\}  \textcolor{red}{-\lambda||\beta||}$
422
 
423
# Casebase: Parametric families
424
* We can now fit models of the form:
425
$$\ log(h(t;\alpha,\beta))=g(t;\alpha)+\beta X$$
426
* By changing the function $g(t;\alpha)$, we can model different parametric families easily:
427
428
429
# Casebase: Parametric models
430
*Exponential*: $g(t;\alpha)$ is equal to a constant
431
```{r,echo=TRUE,eval=FALSE}
432
casebase::fitSmoothHazard(status ~ X1 + X2)
433
```
434
*Gompertz*: $g(t;\alpha)=\alpha t$
435
```{r,echo=TRUE,eval=FALSE}
436
casebase::fitSmoothHazard(status ~ time + X1 + X2)
437
```
438
*Weibull*: $g(t;\alpha) = \alpha log(t)$
439
```{r,echo=TRUE,eval=FALSE}
440
casebase::fitSmoothHazard(status ~ log(time) + X1 + X2)
441
```
442
443
444
# Absolute Risk
445
* We have a bunch of different parametric hazard models now.
446
* To get the absolute risk, we need to evaluate the following equation in relation to the hazard:
447
$$CI(x,t)=1-e^{-\int^{t}_{0}h(x,u)du}$$
448
* *CI(x,t)*= Cumulative Incidence (Absolute Risk)
449
* *h(x,u)*= Hazard function
450
451
* Lets use the weibull hazard.
452
453
454
# Models to be compared
455
456
* Recall: Case study to demonstrate regularization using our method.
457
458
* **unadjusted**: (death, time) ~ 1
459
* **Cox**: (death, time) ~  $\beta X$
460
* **Coxnet**: (death, time) ~  $\beta X$ <--Lasso
461
* **Penalized logistic**: death ~ log(time) + $\beta X$ <--Lasso 
462
463
464
465
```{r supportSetUp, echo = FALSE, eval = TRUE}
466
467
ratio=10  #universal ratio for TRUE
468
#keep one individual out, at random, for which we will develop an 
469
#absolute risk curve.
470
#sam=sample(1:nrow(completeData), )
471
#Set workingData to the remaining individuals in the data
472
#workingData=completeData[-c(sam),]
473
474
#x and y will be used to fit the casebase model under glmnet.
475
#x=as.matrix(sparse.model.matrix(death~ .-d.time+0,data=workingData))
476
y=data.matrix(train[,c(2,1)])
477
colnames(y)<-c("death","d.time")
478
x=data.matrix(train[,-c(2,1)])
479
newData=data.matrix(test[,-c(2,1)])
480
```
481
482
```{r coxHazAbsolute, echo = FALSE }
483
#Create u and xCox, which will be used when fitting Cox with glmnet.
484
485
u=survival::Surv(time = as.numeric(y[,2]), event = as.factor(y[,1]))
486
487
xCox=as.matrix(cbind(data.frame("(Intercept)"=rep(1,each=length(x[,1]))),x))
488
colnames(xCox)[2]="(Intercept)"
489
490
#hazard for cox using glmnet
491
coxNet=glmnet::cv.glmnet(x=x,y=u, family="cox",alpha=1,standardize = TRUE)
492
#convergence demonstrated in plot
493
#plot(coxNet)
494
#taking the coefficient estimates for later use
495
nonzero_covariate_cox <- predict(coxNet, type = "nonzero", s = "lambda.1se")
496
nonzero_coef_cox <- coef(coxNet, s = "lambda.1se")
497
#creating a new dataset that only contains the covariates chosen through glmnet.
498
#cleanCoxData<- as.data.frame(cbind(as.numeric(workingData$d.time),as.factor(workingData$death), xCox[,nonzero_covariate_cox$X1]))
499
cleanCoxData<-as.data.frame(cbind(d.time=as.numeric(y[,2]),death=as.numeric(y[,1]),x[,nonzero_covariate_cox$X1]))
500
#newDataCox<-xCox[sam,nonzero_covariate_cox$X1]
501
#fitting a cox model using regular estimation, however we will not keep it.
502
#this is used more as an object place holder.
503
coxNet <- survival::coxph(Surv(time=d.time,event=death) ~ ., data = cleanCoxData)
504
#The coefficients of this object will be replaced with the estimates from coxNet.
505
#Doing so makes it so that everything is invalid aside from the coefficients.
506
#In this case, all we need to estimate the absolute risk is the coefficients.
507
#Std. error would be incorrect here, if we were to draw error bars.
508
coxNet$coefficients<-nonzero_coef_cox@x
509
#Fitting absolute risk curve for cox+glmnet. -1 to remove intercept
510
#posCox=nonzero_covariate_cox$X1
511
#newDataCoxI=cleanCoxDataI[,posCox]
512
513
abcoxNet<-survival::survfit(coxNet,type="breslow",newdata=as.data.frame(newData[,nonzero_covariate_cox$X1]))
514
515
516
abCN=as.data.frame(cbind(abcoxNet$time, 1-rowMeans(abcoxNet$surv)))
517
colnames(abCN)<-c("Time","Absolute risk")
518
abCN$Model="Coxnet"
519
abCN=abCN[abCN$Time<=max(abKMcompare[,1]),]
520
#plot(abCN[,1],1-abCN[,2])
521
```
522
523
```{r cbHazard}, echo = FALSE, eval = TRUE}
524
525
cbFull=casebase::fitSmoothHazard.fit(x,y,family="glmnet",time="d.time",event="death",formula_time = ~log(d.time),alpha=1,ratio=ratio,standardize = TRUE)
526
```
527
528
```{r CBabsolute, echo = FALSE, eval = TRUE}
529
#Estimating the absolute risk curve using the newData parameter.
530
abCbFull=casebase::absoluteRisk(cbFull,time = seq(1,max(abKMcompare[,1]), 25),newdata = newData , s="lambda.1se",method=c("numerical"))
531
abCB=data.frame(abCbFull[,1],rowMeans(abCbFull[,-c(1)]))
532
533
colnames(abCB)<-c("Time","Absolute risk")
534
abCB$Model="Penalized logistic"
535
536
ab=rbind(abCB,abCN)
537
538
```
539
540
# Survival comparison
541
542
```{r include=TRUE}
543
ggplot(as.data.frame(ab), aes(Time)) + 
544
    geom_line(data=as.data.frame(baseCoxComparisonData),mapping=aes(y = Full, colour = "Cox"),lwd=2)+
545
  geom_line(aes(y = `Absolute risk`, colour = Model),lwd=2)+
546
  geom_step(data=as.data.frame(abKMcompare),mapping=aes(x=Time,y=KM,color="Unadjusted"),lwd=2)+
547
  scale_colour_manual(name = "Models",values = myColors[c(4,1,3,2)])+
548
  labs(y= "Probability of Death", x = "Survival time (Days)")
549
```
550
551
# Conclusions / take homes
552
553
* Classical survival analysis requires methods to encorporate censorship in our data.
554
* Case-base sampling is a techinique that implicitly encorporates censorship implicitly.
555
* Logistic regression on SUPPORT dataset had slightly different results near the end of follow-up time. 
556
557
# Future work
558
559
* Comparative measure.
560
* Survival GWAS.
561
562
563
# References 1
564
565
1.Czepiel, S. A. (2002). Maximum likelihood estimation of logistic regression models: theory and implementation. Available at czep. net/stat/mlelr. pdf, 1825252548-1564645290.
566
567
568
2.Hanley, James A, and Olli S Miettinen. 2009. "Fitting Smooth-in-Time Prognostic Risk Functions via Logistic Regression." *The International Journal of Biostatistics 5 (1)*.
569
570
3.Harrell, F. (2020). SupportDesc < Main < Vanderbilt Biostatistics Wiki. [online] Biostat.mc.vanderbilt.edu. Available at: http://biostat.mc.vanderbilt.edu/wiki/Main/SupportDesc [Accessed 25 Feb. 2020].
571
572
# References 2
573
574
4.Knaus, W. A., Harrell, F. E., Lynn, J., Goldman, L., Phillips, R. S., Connors, A. F., ... & Layde, P. (1995). The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Annals of internal medicine, 122(3), 191-203.
575
576
5.Saarela, Olli, and Elja Arjas. 2015. "Non-Parametric Bayesian Hazard Regression for Chronic Disease Risk Assessment." Scandinavian Journal of Statistics 42 (2). Wiley Online Library: 609–26. 
577
578
6.Saarela, Olli. 2015. "A Case-Base Sampling Method for Estimating Recurrent Event Intensities." *Lifetime Data Analysis*. Springer, 1–17
579
580
# References 3
581
582
7.Scrucca L, Santucci A, Aversa F. Competing risk analysis using R: an easy guide for clinicians. *Bone Marrow Transplant*. 2007 Aug;40(4):381-7. doi: 10.1038/sj.bmt.1705727.
583
584
8.Turgeon, M. (2017, June 10). Retrieved May 05, 2019, from https://www.maxturgeon.ca/slides/MTurgeon-2017-Student-Conference.pdf
585
586
587
588
# Tutorial and slides
589
590
<center>
591
592
**Tutorial:**
593
594
http://sahirbhatnagar.com/casebase/
595
596
**Slides:**
597
598
https://github.com/Jesse-Islam/ATGC_survival_presentation_Feb.27.2020
599
600
**Questions?**
601
602
</center>
603
604