|
a |
|
b/presentationGenerator.Rmd |
|
|
1 |
--- |
|
|
2 |
title: "Penalized logistic regression on time-to-event data using casebase sampling" |
|
|
3 |
author: "Jesse Islam" |
|
|
4 |
date: "2/27/2020" |
|
|
5 |
mathspec: true |
|
|
6 |
theme: "metropolis" |
|
|
7 |
output: |
|
|
8 |
beamer_presentation: |
|
|
9 |
incremental: FALSE |
|
|
10 |
fig_caption: false |
|
|
11 |
header-includes: |
|
|
12 |
\usepackage{xcolor} |
|
|
13 |
--- |
|
|
14 |
|
|
|
15 |
```{r setup, include=FALSE} |
|
|
16 |
source("packages.R") |
|
|
17 |
set.seed(1) |
|
|
18 |
knitr::opts_chunk$set( warning=FALSE,message=FALSE,echo=FALSE) |
|
|
19 |
knitr::opts_chunk$set(dev = 'pdf') |
|
|
20 |
``` |
|
|
21 |
|
|
|
22 |
|
|
|
23 |
|
|
|
24 |
# Popular methods in time-to-event analysis |
|
|
25 |
|
|
|
26 |
* In disease etiology, we tend to make use of the proportional hazards hypothesis. |
|
|
27 |
* Cox Regression |
|
|
28 |
|
|
|
29 |
* When we want the absolute risk: |
|
|
30 |
* Breslow estimator |
|
|
31 |
* Parametric models |
|
|
32 |
|
|
|
33 |
# Motivations for a new method |
|
|
34 |
|
|
|
35 |
* Julien and Hanley found that survival analysis rarely produces prognostic functions, even though the software is widely available in cox regression packages. [2] |
|
|
36 |
* They believe the stepwise nature is the reason, as it reduces interpretability. [2] |
|
|
37 |
* A streamlined approach for reaching a **smooth absolute risk** curve. [2] |
|
|
38 |
|
|
|
39 |
# Dr. Cox's perspective |
|
|
40 |
|
|
|
41 |
 |
|
|
42 |
|
|
|
43 |
# Itinerary |
|
|
44 |
|
|
|
45 |
1. SUPPORT study |
|
|
46 |
2. Casebase sampling |
|
|
47 |
3. Penalized logistic regression on survival data |
|
|
48 |
4. Maximum likelihood with regularization |
|
|
49 |
5. Absolute risk comparison |
|
|
50 |
6. Conclusion |
|
|
51 |
7. Future work |
|
|
52 |
8. References |
|
|
53 |
|
|
|
54 |
# SUPPORT dataset [4] |
|
|
55 |
|
|
|
56 |
* **Study to Understand Prognoses and Preferences for Outcomes and Risks Treatments** |
|
|
57 |
* Design: Prospective cohort study. |
|
|
58 |
* Setting: 5 academic care centers in the United States. |
|
|
59 |
* Participants: 9105 hospitalized. |
|
|
60 |
* Follow-up-time: 5.56 years. |
|
|
61 |
* 68% incidence rate. |
|
|
62 |
|
|
|
63 |
# SUPPORT manual imputation [4] |
|
|
64 |
|
|
|
65 |
* Notorious for missing data |
|
|
66 |
\begin{table}[] |
|
|
67 |
\centering |
|
|
68 |
\resizebox{\textwidth}{!}{% |
|
|
69 |
\begin{tabular}{ll} |
|
|
70 |
\hline |
|
|
71 |
Baseline Variable & Normal Fill-in Value \\ \hline |
|
|
72 |
Bilirubin & 1.01 \\ |
|
|
73 |
BUN & 6.51 \\ |
|
|
74 |
Creatinine & 1.01 \\ |
|
|
75 |
PaO2/FiO2 ratio (pafi) & 333.3 \\ |
|
|
76 |
Serum albumin & 3.5 \\ |
|
|
77 |
Urine output & 2502 \\ |
|
|
78 |
White blood count & 9 (thousands) \\ \hline |
|
|
79 |
\end{tabular}% |
|
|
80 |
} |
|
|
81 |
\caption{Suggested imputation values. {[}3{]}} |
|
|
82 |
\label{tab:my-table} |
|
|
83 |
\end{table} |
|
|
84 |
|
|
|
85 |
# SUPPORT automated imputation |
|
|
86 |
|
|
|
87 |
* Mice imputation package (R) |
|
|
88 |
|
|
|
89 |
1. PMM (Predictive Mean Matching) – For numeric variables |
|
|
90 |
2. logreg(Logistic Regression) – Binary Variables |
|
|
91 |
3. polyreg(Bayesian polytomous regression) Factor Variables |
|
|
92 |
|
|
|
93 |
# Removed variables [4] |
|
|
94 |
|
|
|
95 |
* **Response variables** |
|
|
96 |
* Hospital Charges. |
|
|
97 |
* Patient ratio of costs to charges. |
|
|
98 |
* Patient Micro-costs. |
|
|
99 |
* **Ordinal covariates** |
|
|
100 |
* functional disability. |
|
|
101 |
* Income. |
|
|
102 |
* **Sparse covariate** |
|
|
103 |
* Surrogate activities of daily living. |
|
|
104 |
* Previous model results. (6) |
|
|
105 |
|
|
|
106 |
# Variable overview [4] |
|
|
107 |
|
|
|
108 |
* **Response variables** |
|
|
109 |
* follow-up time, death. |
|
|
110 |
|
|
|
111 |
* **Covariates** |
|
|
112 |
* Age, sex, race, education (6) |
|
|
113 |
* Disease group/class, comorbidities. (3) |
|
|
114 |
* Coma score, Therapeutic Intervention Scoring System (2) |
|
|
115 |
* Physiological variables. (11) |
|
|
116 |
* Activities of daily living. (2) |
|
|
117 |
|
|
|
118 |
|
|
|
119 |
|
|
|
120 |
|
|
|
121 |
# Original SUPPORT analysis [4] |
|
|
122 |
|
|
|
123 |
* Determined SUPPORT prognostic model on phase I patients. |
|
|
124 |
* Tested on Phase II. |
|
|
125 |
* Both on the scale of 180 days. |
|
|
126 |
|
|
|
127 |
# Original SUPPORT analysis [4] |
|
|
128 |
|
|
|
129 |
SUPPORT physiology score (SPS) was developed. |
|
|
130 |
```{r, out.width = "80%",include=TRUE} |
|
|
131 |
knitr::include_graphics("SPS.png") |
|
|
132 |
``` |
|
|
133 |
|
|
|
134 |
|
|
|
135 |
|
|
|
136 |
# SUPPORT: my question |
|
|
137 |
|
|
|
138 |
* How does their SPS perform over 5.56 years? |
|
|
139 |
* How does the Apache III physiology score (APS) perform over 5.56 years? |
|
|
140 |
* How does a model with all the covariates, excluding SPS and APS, perform? |
|
|
141 |
|
|
|
142 |
# Analysis Process |
|
|
143 |
|
|
|
144 |
1. Impute |
|
|
145 |
2. Compare SPS and APS over ~5.56 years using absolute risk. |
|
|
146 |
3. Compare to Kaplan-Meier curve (unadjusted model). |
|
|
147 |
4. Compare to full model (excluding SPS and APS). |
|
|
148 |
|
|
|
149 |
* All models is trained on 80% of the observations. |
|
|
150 |
* Remaining observations are used to generate comparative absolute risk curves. |
|
|
151 |
* The absolute risk curve for each individual is averaged. |
|
|
152 |
* Averaged curve is expected to approach Kaplan-meier curve. |
|
|
153 |
|
|
|
154 |
```{r data processing,include=FALSE} |
|
|
155 |
|
|
|
156 |
#DATA retrieved from http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/support2csv.zip |
|
|
157 |
ew <- read_csv("support2.csv") |
|
|
158 |
data<-read.csv('support2.csv', col.names=colnames(ew), fill=TRUE, header=TRUE) |
|
|
159 |
rm(ew) |
|
|
160 |
``` |
|
|
161 |
|
|
|
162 |
|
|
|
163 |
```{r cleaningImputation,include=FALSE} |
|
|
164 |
#manual imputation |
|
|
165 |
data$alb[is.na(data$alb)]=3.5 |
|
|
166 |
data$pafi[is.na(data$pafi)]=333.3 |
|
|
167 |
data$bili[is.na(data$bili)]=1.01 |
|
|
168 |
data$crea[is.na(data$crea)]=1.01 |
|
|
169 |
data$bun[is.na(data$bun)]=6.51 |
|
|
170 |
data$wblc[is.na(data$wblc)]=9 |
|
|
171 |
data$urine[is.na(data$urine)]=2502 |
|
|
172 |
|
|
|
173 |
# automated imputation |
|
|
174 |
previousModelPositions<-c(18,19,20,21,22,23,24,25,26,27,28,29) |
|
|
175 |
|
|
|
176 |
supportMain<-data[,-previousModelPositions] |
|
|
177 |
supportPrevious<-data[,previousModelPositions] |
|
|
178 |
#keep 1,2 after imputation from previous usable models. |
|
|
179 |
|
|
|
180 |
#missingPattern<-md.pattern(supportMain) |
|
|
181 |
supportMainImpute <-mice(supportMain[,-c(11,34,13,14,15)], m = 1,printFlag=FALSE) #impute while removing ordinal variables and other response variables. |
|
|
182 |
imputedData<-cbind(complete(supportMainImpute,1),supportPrevious[,c(1,2)]) |
|
|
183 |
#adls is missing 33%, and is removed accordingly and hospitaldeath |
|
|
184 |
#md.pattern(completeData) |
|
|
185 |
completeData<-na.omit(imputedData[,-c(4,29)]) |
|
|
186 |
workingCompleteData=model.matrix(death~ .-d.time,data=completeData)[,-c(1)]#remove intercept |
|
|
187 |
#Create u and xCox, which will be used when fitting Cox with glmnet. |
|
|
188 |
newData=workingCompleteData |
|
|
189 |
x=workingCompleteData |
|
|
190 |
#x and y will be used to fit the casebase model under glmnet. |
|
|
191 |
#x=as.matrix(sparse.model.matrix(death~ .-d.time+0,data=workingData)) |
|
|
192 |
y=data.matrix(completeData[,c(2,5)]) |
|
|
193 |
``` |
|
|
194 |
|
|
|
195 |
```{r CoxExplorationModels,include=FALSE, eval = TRUE} |
|
|
196 |
fullDataCox<-as.data.frame(cbind(as.numeric(y[,2]),as.numeric(y[,1]),x)) |
|
|
197 |
train_index <- sample(1:nrow(fullDataCox), 0.8 * nrow(fullDataCox)) |
|
|
198 |
test_index <- setdiff(1:nrow(fullDataCox), train_index) |
|
|
199 |
|
|
|
200 |
train<-fullDataCox[train_index,] |
|
|
201 |
test<-fullDataCox[test_index,] |
|
|
202 |
coxSPS<- survival::coxph(Surv(time=V1,event=V2) ~ sps, data = train) |
|
|
203 |
abcoxSPS<-survival::survfit(coxSPS,newdata = test) |
|
|
204 |
|
|
|
205 |
coxAPS<- survival::coxph(Surv(time=V1,event=V2) ~ aps, data = train) |
|
|
206 |
abcoxAPS<-survival::survfit(coxAPS,newdata = test) |
|
|
207 |
|
|
|
208 |
coxFull<- survival::coxph(Surv(time=V1,event=V2) ~ .-sps -aps, data = train) |
|
|
209 |
abcoxFull<-survival::survfit(coxAPS,newdata = test) |
|
|
210 |
|
|
|
211 |
KM<- survival::coxph(Surv(time=V1,event=V2) ~ 1, data = test) |
|
|
212 |
abKM<-survival::survfit(KM) |
|
|
213 |
abKMcompare<-cbind(abKM$time[abKM$time<=1631],1-abKM$surv[abKM$time<=1631]) |
|
|
214 |
colnames(abKMcompare)<-c("Time","KM") |
|
|
215 |
|
|
|
216 |
baseCoxComparisonData<-cbind(abcoxSPS$time[abcoxSPS$time<=1631],1-rowMeans(abcoxSPS$surv[abcoxSPS$time<=1631,]),1-rowMeans(abcoxAPS$surv[abcoxSPS$time<=1631,]),1-abKM$surv,1-rowMeans(abcoxFull$surv[abcoxSPS$time<=1631,])) |
|
|
217 |
colnames(baseCoxComparisonData)<-c("Time","SPS","APS","Unadjusted","Full") |
|
|
218 |
myColors <- brewer.pal(5,"Dark2") |
|
|
219 |
FigureWidth='90%' |
|
|
220 |
``` |
|
|
221 |
|
|
|
222 |
|
|
|
223 |
# SPS vs APS |
|
|
224 |
```{r SPSvsAPS,include=TRUE, out.width = FigureWidth} |
|
|
225 |
#Full |
|
|
226 |
ggplot(as.data.frame(baseCoxComparisonData), aes(Time)) + |
|
|
227 |
geom_line(aes(y = SPS, colour = "SPS"),lwd=2) + |
|
|
228 |
geom_line(aes(y = APS, colour = "APS"),lwd=2) + |
|
|
229 |
scale_colour_manual(name = "Models",values = myColors[c(3,4)])+ |
|
|
230 |
labs(y= "Probability of Death", x = "Survival time (Days)") |
|
|
231 |
``` |
|
|
232 |
|
|
|
233 |
# SPS vs. Kaplan-Meier |
|
|
234 |
```{r SPSvsKM,include=TRUE, out.width = FigureWidth} |
|
|
235 |
ggplot(as.data.frame(baseCoxComparisonData), aes(Time)) + |
|
|
236 |
geom_line(aes(y = SPS, colour = "SPS"),lwd=2) + |
|
|
237 |
geom_line(aes(y = APS, colour = "APS"),lwd=2) + |
|
|
238 |
geom_step(data=as.data.frame(abKMcompare),mapping=aes(x=Time,y=KM,color="Unadjusted"),lwd=2)+ |
|
|
239 |
scale_colour_manual(name = "Models",values = myColors[c(3,4,2)])+ |
|
|
240 |
labs(y= "Probability of Death", x = "Survival time (Days)") |
|
|
241 |
``` |
|
|
242 |
|
|
|
243 |
# All covariates vs. physiology scores vs unadjusted |
|
|
244 |
```{r FullvsKM,include=TRUE, out.width = FigureWidth} |
|
|
245 |
ggplot(as.data.frame(baseCoxComparisonData), aes(Time)) + |
|
|
246 |
geom_line(aes(y = SPS, colour = "SPS"),lwd=2) + |
|
|
247 |
geom_line(aes(y = APS, colour = "APS"),lwd=2) + |
|
|
248 |
geom_step(data=as.data.frame(abKMcompare),mapping=aes(x=Time,y=KM,color="Unadjusted"),lwd=2)+ |
|
|
249 |
geom_line(aes(y = Full, col = "All Covariates without SPS APS"),lwd=2)+ |
|
|
250 |
scale_colour_manual(name = "Models",values = myColors[c(1,3,4,2)])+ |
|
|
251 |
labs(y= "Probability of Death", x = "Survival time (Days)") |
|
|
252 |
``` |
|
|
253 |
|
|
|
254 |
|
|
|
255 |
# Chosen absolute risk comparisons |
|
|
256 |
```{r ,include=TRUE, out.width = FigureWidth} |
|
|
257 |
ggplot(as.data.frame(baseCoxComparisonData), aes(Time)) + |
|
|
258 |
geom_step(data=as.data.frame(abKMcompare),mapping=aes(x=Time,y=KM,color="Unadjusted"),lwd=2)+ |
|
|
259 |
geom_line(aes(y = Full, colour = "Full"),lwd=2)+ |
|
|
260 |
scale_colour_manual(name = "Models",values = myColors[c(1,2)])+ |
|
|
261 |
labs(y= "Probability of Death", x = "Survival time (Days)") |
|
|
262 |
``` |
|
|
263 |
|
|
|
264 |
# Chosen absolute risk comparisons: conclusion |
|
|
265 |
|
|
|
266 |
* Linear associations without physiology scores perform similarly to SPS and APS alone. |
|
|
267 |
* We choose the linear associations without physiology scores as the model of choice (Full model). |
|
|
268 |
|
|
|
269 |
# Casebase sampling overview |
|
|
270 |
|
|
|
271 |
1. Clever sampling. |
|
|
272 |
2. Implicitly deals with censoring. |
|
|
273 |
3. Allows a parametric fit using *logistic regression*. |
|
|
274 |
|
|
|
275 |
* Casebase is parametric, and allows different parametric fits by incorporation of the time component. |
|
|
276 |
* Package contains an implementation for generating *population-time* plots. |
|
|
277 |
|
|
|
278 |
# Casebase: Sampling |
|
|
279 |
|
|
|
280 |
```{r DataModel, echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth} |
|
|
281 |
nobs <- nrow(data) |
|
|
282 |
ftime <- data$d.time |
|
|
283 |
ord <- order(ftime, decreasing = FALSE) |
|
|
284 |
#ordSample=order(sample(ord,1000)) |
|
|
285 |
yCoords <- cbind(cumsum(data[ord, "death"] == 1), |
|
|
286 |
cumsum(data[ord, "death"] == 0)) |
|
|
287 |
yCoords <- cbind(yCoords, nobs - rowSums(yCoords)) |
|
|
288 |
aspectRatio <- 0.75 |
|
|
289 |
height <- 8.5 * aspectRatio; width <- 11 * aspectRatio |
|
|
290 |
cases <- data$death == 1 |
|
|
291 |
comps <-data$death == 2 |
|
|
292 |
# randomly move the cases vertically |
|
|
293 |
moved_cases <- yCoords[cases[ord], 3] * runif(sum(cases)) |
|
|
294 |
moved_comps <- yCoords[comps[ord], 3] * runif(sum(comps)) |
|
|
295 |
moved_bases <- yCoords[cases[ord], 3] * runif(sum(cases)) |
|
|
296 |
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs), |
|
|
297 |
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5) |
|
|
298 |
ptsz=0.5 |
|
|
299 |
``` |
|
|
300 |
|
|
|
301 |
# Casebase: Sampling |
|
|
302 |
```{r , echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth} |
|
|
303 |
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs), |
|
|
304 |
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5) |
|
|
305 |
polygon(c(0, 0, ftime[ord], max(ftime), 0), |
|
|
306 |
c(0, nobs, yCoords[,3], 0, 0), |
|
|
307 |
col = "grey60") |
|
|
308 |
legend("topright", legend = c("Base"), |
|
|
309 |
col = c("grey60"), |
|
|
310 |
pch = 19,cex=1.5) |
|
|
311 |
``` |
|
|
312 |
|
|
|
313 |
# Casebase: Sampling |
|
|
314 |
```{r , echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth} |
|
|
315 |
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs), |
|
|
316 |
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5) |
|
|
317 |
polygon(c(0, 0, ftime[ord], max(ftime), 0), |
|
|
318 |
c(0, nobs, yCoords[,3], 0, 0), |
|
|
319 |
col = "grey60") |
|
|
320 |
points((ftime[ord])[cases[ord]], moved_bases, pch = 19, |
|
|
321 |
col = "blue", cex = ptsz) |
|
|
322 |
legend("topright", legend = c("Base","base-series"), |
|
|
323 |
col = c("grey60","blue"), |
|
|
324 |
pch = 19,cex=1.5) |
|
|
325 |
``` |
|
|
326 |
|
|
|
327 |
# Casebase: Sampling |
|
|
328 |
```{r , echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth} |
|
|
329 |
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs), |
|
|
330 |
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5) |
|
|
331 |
polygon(c(0, 0, ftime[ord], max(ftime), 0), |
|
|
332 |
c(0, nobs, yCoords[,3], 0, 0), |
|
|
333 |
col = "grey60") |
|
|
334 |
points((ftime[ord])[cases[ord]], moved_bases, pch = 19, |
|
|
335 |
col = "blue", cex = ptsz) |
|
|
336 |
points((ftime[ord])[cases[ord]], yCoords[cases[ord],3], pch = 19, |
|
|
337 |
col = "firebrick3", cex = ptsz) |
|
|
338 |
legend("topright", legend = c("Base","base-series","case-series (Death)"), |
|
|
339 |
col = c("grey60","blue","firebrick3"), |
|
|
340 |
pch = 19,cex=1.5) |
|
|
341 |
``` |
|
|
342 |
|
|
|
343 |
# Casebase: Sampling |
|
|
344 |
|
|
|
345 |
```{r , echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth} |
|
|
346 |
|
|
|
347 |
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs), |
|
|
348 |
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5) |
|
|
349 |
polygon(c(0, 0, ftime[ord], max(ftime), 0), |
|
|
350 |
c(0, nobs, yCoords[,3], 0, 0), |
|
|
351 |
col = "grey60") |
|
|
352 |
points((ftime[ord])[cases[ord]], moved_bases, pch = 19, |
|
|
353 |
col = "blue", cex = ptsz) |
|
|
354 |
points((ftime[ord])[cases[ord]], moved_cases, pch = 19, |
|
|
355 |
col = "firebrick3", cex = ptsz) |
|
|
356 |
#points((ftime[ord])[bases[ord]], moved_bases, pch = 19, |
|
|
357 |
#col = "blue", cex = ptsz) |
|
|
358 |
legend("topright", legend = c("Base","base-series","case-series (Death)"), |
|
|
359 |
col = c("grey60", "blue","firebrick3"), |
|
|
360 |
pch = 19,cex=1.5) |
|
|
361 |
|
|
|
362 |
|
|
|
363 |
``` |
|
|
364 |
|
|
|
365 |
# Casebase: Sampling |
|
|
366 |
|
|
|
367 |
$$e^{L}=\frac{Pr(Y=1|x,t)}{Pr(Y=0|x,t)}=\frac{h(x,t)*B(x,t)}{b[B(x,t)/B]}=\frac{h(x,t)*B}{b}$$ |
|
|
368 |
|
|
|
369 |
* $L=\beta X$ |
|
|
370 |
* $b$ = base-series. |
|
|
371 |
* $B$ = Base. |
|
|
372 |
* *B(x,t)* = Risk-set for survival time $t$. |
|
|
373 |
|
|
|
374 |
|
|
|
375 |
# Casebase: Sampling |
|
|
376 |
|
|
|
377 |
$e^{L}=\frac{Pr(Y=1|x,t)}{Pr(Y=0|x,t)}=\frac{h(x,t)*B(x,t)}{b[B(x,t)/B]}=\frac{h(x,t)*B}{b}$ |
|
|
378 |
|
|
|
379 |
```{r , echo=FALSE, warning=FALSE,message=FALSE,include=TRUE, out.width = FigureWidth} |
|
|
380 |
|
|
|
381 |
plot(0, type = 'n', xlim = c(0, max(ftime)), ylim = c(0, nobs), |
|
|
382 |
xlab = 'Follow-up time (days)', ylab = 'Population',cex.lab=1.5, cex.axis=1.5, cex.main=1.5, cex.sub=1.5) |
|
|
383 |
polygon(c(0, 0, ftime[ord], max(ftime), 0), |
|
|
384 |
c(0, nobs, yCoords[,3], 0, 0), |
|
|
385 |
col = "grey60") |
|
|
386 |
points((ftime[ord])[cases[ord]], moved_bases, pch = 19, |
|
|
387 |
col = "blue", cex = ptsz) |
|
|
388 |
points((ftime[ord])[cases[ord]], moved_cases, pch = 19, |
|
|
389 |
col = "firebrick3", cex = ptsz) |
|
|
390 |
#points((ftime[ord])[bases[ord]], moved_bases, pch = 19, |
|
|
391 |
#col = "blue", cex = ptsz) |
|
|
392 |
abline(v=990,lwd=2) |
|
|
393 |
abline(v=1010,lwd=2) |
|
|
394 |
legend("topright", legend = c("Base","base-series","case-series (Death)"), |
|
|
395 |
col = c("grey60", "blue","firebrick3"), |
|
|
396 |
pch = 19,cex=1.5) |
|
|
397 |
``` |
|
|
398 |
|
|
|
399 |
|
|
|
400 |
|
|
|
401 |
# log-odds = log hazard |
|
|
402 |
|
|
|
403 |
$$e^{L}=\frac{\hat{h}(x,t)*B}{b} $$ |
|
|
404 |
$$ \hat{h}(x,t)= \frac{b*e^{L}}{B}$$ |
|
|
405 |
$$log(\hat{h}(x,t))= L+log(\frac{b}{B})$$ |
|
|
406 |
|
|
|
407 |
|
|
|
408 |
|
|
|
409 |
|
|
|
410 |
|
|
|
411 |
# Maximum log-likelihood [1] |
|
|
412 |
|
|
|
413 |
$$ log(l(\beta_{0},\beta))=\frac{1}{N}\sum^{N}_{i=1}\{y_{i}(\beta_{0}+x_{i}^{T}\beta)-log(1+e^{\beta_{0}+x_{i}^{T}\beta})\} $$ |
|
|
414 |
|
|
|
415 |
# Maximum log-likelihood, with offset |
|
|
416 |
$$ log(l(\textcolor{blue}{log(\frac{b}{B})},\beta))=\frac{1}{N}\sum^{N}_{i=1}\{y_{i}(\textcolor{blue}{log(\frac{b}{B})}+x_{i}^{T}\beta)-log(1+e^{\textcolor{blue}{log(\frac{b}{B})}+x_{i}^{T}\beta})\} $$ |
|
|
417 |
|
|
|
418 |
# Maximum log-likelihood, with offset and lasso |
|
|
419 |
|
|
|
420 |
|
|
|
421 |
$\frac{1}{N}\sum^{N}_{i=1}\{y_{i}(\textcolor{blue}{log(\frac{b}{B})}+x_{i}^{T}\beta)-log(1+e^{\textcolor{blue}{log(\frac{b}{B})}+x_{i}^{T}\beta})\} \textcolor{red}{-\lambda||\beta||}$ |
|
|
422 |
|
|
|
423 |
# Casebase: Parametric families |
|
|
424 |
* We can now fit models of the form: |
|
|
425 |
$$\ log(h(t;\alpha,\beta))=g(t;\alpha)+\beta X$$ |
|
|
426 |
* By changing the function $g(t;\alpha)$, we can model different parametric families easily: |
|
|
427 |
|
|
|
428 |
|
|
|
429 |
# Casebase: Parametric models |
|
|
430 |
*Exponential*: $g(t;\alpha)$ is equal to a constant |
|
|
431 |
```{r,echo=TRUE,eval=FALSE} |
|
|
432 |
casebase::fitSmoothHazard(status ~ X1 + X2) |
|
|
433 |
``` |
|
|
434 |
*Gompertz*: $g(t;\alpha)=\alpha t$ |
|
|
435 |
```{r,echo=TRUE,eval=FALSE} |
|
|
436 |
casebase::fitSmoothHazard(status ~ time + X1 + X2) |
|
|
437 |
``` |
|
|
438 |
*Weibull*: $g(t;\alpha) = \alpha log(t)$ |
|
|
439 |
```{r,echo=TRUE,eval=FALSE} |
|
|
440 |
casebase::fitSmoothHazard(status ~ log(time) + X1 + X2) |
|
|
441 |
``` |
|
|
442 |
|
|
|
443 |
|
|
|
444 |
# Absolute Risk |
|
|
445 |
* We have a bunch of different parametric hazard models now. |
|
|
446 |
* To get the absolute risk, we need to evaluate the following equation in relation to the hazard: |
|
|
447 |
$$CI(x,t)=1-e^{-\int^{t}_{0}h(x,u)du}$$ |
|
|
448 |
* *CI(x,t)*= Cumulative Incidence (Absolute Risk) |
|
|
449 |
* *h(x,u)*= Hazard function |
|
|
450 |
|
|
|
451 |
* Lets use the weibull hazard. |
|
|
452 |
|
|
|
453 |
|
|
|
454 |
# Models to be compared |
|
|
455 |
|
|
|
456 |
* Recall: Case study to demonstrate regularization using our method. |
|
|
457 |
|
|
|
458 |
* **unadjusted**: (death, time) ~ 1 |
|
|
459 |
* **Cox**: (death, time) ~ $\beta X$ |
|
|
460 |
* **Coxnet**: (death, time) ~ $\beta X$ <--Lasso |
|
|
461 |
* **Penalized logistic**: death ~ log(time) + $\beta X$ <--Lasso |
|
|
462 |
|
|
|
463 |
|
|
|
464 |
|
|
|
465 |
```{r supportSetUp, echo = FALSE, eval = TRUE} |
|
|
466 |
|
|
|
467 |
ratio=10 #universal ratio for TRUE |
|
|
468 |
#keep one individual out, at random, for which we will develop an |
|
|
469 |
#absolute risk curve. |
|
|
470 |
#sam=sample(1:nrow(completeData), ) |
|
|
471 |
#Set workingData to the remaining individuals in the data |
|
|
472 |
#workingData=completeData[-c(sam),] |
|
|
473 |
|
|
|
474 |
#x and y will be used to fit the casebase model under glmnet. |
|
|
475 |
#x=as.matrix(sparse.model.matrix(death~ .-d.time+0,data=workingData)) |
|
|
476 |
y=data.matrix(train[,c(2,1)]) |
|
|
477 |
colnames(y)<-c("death","d.time") |
|
|
478 |
x=data.matrix(train[,-c(2,1)]) |
|
|
479 |
newData=data.matrix(test[,-c(2,1)]) |
|
|
480 |
``` |
|
|
481 |
|
|
|
482 |
```{r coxHazAbsolute, echo = FALSE } |
|
|
483 |
#Create u and xCox, which will be used when fitting Cox with glmnet. |
|
|
484 |
|
|
|
485 |
u=survival::Surv(time = as.numeric(y[,2]), event = as.factor(y[,1])) |
|
|
486 |
|
|
|
487 |
xCox=as.matrix(cbind(data.frame("(Intercept)"=rep(1,each=length(x[,1]))),x)) |
|
|
488 |
colnames(xCox)[2]="(Intercept)" |
|
|
489 |
|
|
|
490 |
#hazard for cox using glmnet |
|
|
491 |
coxNet=glmnet::cv.glmnet(x=x,y=u, family="cox",alpha=1,standardize = TRUE) |
|
|
492 |
#convergence demonstrated in plot |
|
|
493 |
#plot(coxNet) |
|
|
494 |
#taking the coefficient estimates for later use |
|
|
495 |
nonzero_covariate_cox <- predict(coxNet, type = "nonzero", s = "lambda.1se") |
|
|
496 |
nonzero_coef_cox <- coef(coxNet, s = "lambda.1se") |
|
|
497 |
#creating a new dataset that only contains the covariates chosen through glmnet. |
|
|
498 |
#cleanCoxData<- as.data.frame(cbind(as.numeric(workingData$d.time),as.factor(workingData$death), xCox[,nonzero_covariate_cox$X1])) |
|
|
499 |
cleanCoxData<-as.data.frame(cbind(d.time=as.numeric(y[,2]),death=as.numeric(y[,1]),x[,nonzero_covariate_cox$X1])) |
|
|
500 |
#newDataCox<-xCox[sam,nonzero_covariate_cox$X1] |
|
|
501 |
#fitting a cox model using regular estimation, however we will not keep it. |
|
|
502 |
#this is used more as an object place holder. |
|
|
503 |
coxNet <- survival::coxph(Surv(time=d.time,event=death) ~ ., data = cleanCoxData) |
|
|
504 |
#The coefficients of this object will be replaced with the estimates from coxNet. |
|
|
505 |
#Doing so makes it so that everything is invalid aside from the coefficients. |
|
|
506 |
#In this case, all we need to estimate the absolute risk is the coefficients. |
|
|
507 |
#Std. error would be incorrect here, if we were to draw error bars. |
|
|
508 |
coxNet$coefficients<-nonzero_coef_cox@x |
|
|
509 |
#Fitting absolute risk curve for cox+glmnet. -1 to remove intercept |
|
|
510 |
#posCox=nonzero_covariate_cox$X1 |
|
|
511 |
#newDataCoxI=cleanCoxDataI[,posCox] |
|
|
512 |
|
|
|
513 |
abcoxNet<-survival::survfit(coxNet,type="breslow",newdata=as.data.frame(newData[,nonzero_covariate_cox$X1])) |
|
|
514 |
|
|
|
515 |
|
|
|
516 |
abCN=as.data.frame(cbind(abcoxNet$time, 1-rowMeans(abcoxNet$surv))) |
|
|
517 |
colnames(abCN)<-c("Time","Absolute risk") |
|
|
518 |
abCN$Model="Coxnet" |
|
|
519 |
abCN=abCN[abCN$Time<=max(abKMcompare[,1]),] |
|
|
520 |
#plot(abCN[,1],1-abCN[,2]) |
|
|
521 |
``` |
|
|
522 |
|
|
|
523 |
```{r cbHazard}, echo = FALSE, eval = TRUE} |
|
|
524 |
|
|
|
525 |
cbFull=casebase::fitSmoothHazard.fit(x,y,family="glmnet",time="d.time",event="death",formula_time = ~log(d.time),alpha=1,ratio=ratio,standardize = TRUE) |
|
|
526 |
``` |
|
|
527 |
|
|
|
528 |
```{r CBabsolute, echo = FALSE, eval = TRUE} |
|
|
529 |
#Estimating the absolute risk curve using the newData parameter. |
|
|
530 |
abCbFull=casebase::absoluteRisk(cbFull,time = seq(1,max(abKMcompare[,1]), 25),newdata = newData , s="lambda.1se",method=c("numerical")) |
|
|
531 |
abCB=data.frame(abCbFull[,1],rowMeans(abCbFull[,-c(1)])) |
|
|
532 |
|
|
|
533 |
colnames(abCB)<-c("Time","Absolute risk") |
|
|
534 |
abCB$Model="Penalized logistic" |
|
|
535 |
|
|
|
536 |
ab=rbind(abCB,abCN) |
|
|
537 |
|
|
|
538 |
``` |
|
|
539 |
|
|
|
540 |
# Survival comparison |
|
|
541 |
|
|
|
542 |
```{r include=TRUE} |
|
|
543 |
ggplot(as.data.frame(ab), aes(Time)) + |
|
|
544 |
geom_line(data=as.data.frame(baseCoxComparisonData),mapping=aes(y = Full, colour = "Cox"),lwd=2)+ |
|
|
545 |
geom_line(aes(y = `Absolute risk`, colour = Model),lwd=2)+ |
|
|
546 |
geom_step(data=as.data.frame(abKMcompare),mapping=aes(x=Time,y=KM,color="Unadjusted"),lwd=2)+ |
|
|
547 |
scale_colour_manual(name = "Models",values = myColors[c(4,1,3,2)])+ |
|
|
548 |
labs(y= "Probability of Death", x = "Survival time (Days)") |
|
|
549 |
``` |
|
|
550 |
|
|
|
551 |
# Conclusions / take homes |
|
|
552 |
|
|
|
553 |
* Classical survival analysis requires methods to encorporate censorship in our data. |
|
|
554 |
* Case-base sampling is a techinique that implicitly encorporates censorship implicitly. |
|
|
555 |
* Logistic regression on SUPPORT dataset had slightly different results near the end of follow-up time. |
|
|
556 |
|
|
|
557 |
# Future work |
|
|
558 |
|
|
|
559 |
* Comparative measure. |
|
|
560 |
* Survival GWAS. |
|
|
561 |
|
|
|
562 |
|
|
|
563 |
# References 1 |
|
|
564 |
|
|
|
565 |
1.Czepiel, S. A. (2002). Maximum likelihood estimation of logistic regression models: theory and implementation. Available at czep. net/stat/mlelr. pdf, 1825252548-1564645290. |
|
|
566 |
|
|
|
567 |
|
|
|
568 |
2.Hanley, James A, and Olli S Miettinen. 2009. "Fitting Smooth-in-Time Prognostic Risk Functions via Logistic Regression." *The International Journal of Biostatistics 5 (1)*. |
|
|
569 |
|
|
|
570 |
3.Harrell, F. (2020). SupportDesc < Main < Vanderbilt Biostatistics Wiki. [online] Biostat.mc.vanderbilt.edu. Available at: http://biostat.mc.vanderbilt.edu/wiki/Main/SupportDesc [Accessed 25 Feb. 2020]. |
|
|
571 |
|
|
|
572 |
# References 2 |
|
|
573 |
|
|
|
574 |
4.Knaus, W. A., Harrell, F. E., Lynn, J., Goldman, L., Phillips, R. S., Connors, A. F., ... & Layde, P. (1995). The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized adults. Annals of internal medicine, 122(3), 191-203. |
|
|
575 |
|
|
|
576 |
5.Saarela, Olli, and Elja Arjas. 2015. "Non-Parametric Bayesian Hazard Regression for Chronic Disease Risk Assessment." Scandinavian Journal of Statistics 42 (2). Wiley Online Library: 609–26. |
|
|
577 |
|
|
|
578 |
6.Saarela, Olli. 2015. "A Case-Base Sampling Method for Estimating Recurrent Event Intensities." *Lifetime Data Analysis*. Springer, 1–17 |
|
|
579 |
|
|
|
580 |
# References 3 |
|
|
581 |
|
|
|
582 |
7.Scrucca L, Santucci A, Aversa F. Competing risk analysis using R: an easy guide for clinicians. *Bone Marrow Transplant*. 2007 Aug;40(4):381-7. doi: 10.1038/sj.bmt.1705727. |
|
|
583 |
|
|
|
584 |
8.Turgeon, M. (2017, June 10). Retrieved May 05, 2019, from https://www.maxturgeon.ca/slides/MTurgeon-2017-Student-Conference.pdf |
|
|
585 |
|
|
|
586 |
|
|
|
587 |
|
|
|
588 |
# Tutorial and slides |
|
|
589 |
|
|
|
590 |
<center> |
|
|
591 |
|
|
|
592 |
**Tutorial:** |
|
|
593 |
|
|
|
594 |
http://sahirbhatnagar.com/casebase/ |
|
|
595 |
|
|
|
596 |
**Slides:** |
|
|
597 |
|
|
|
598 |
https://github.com/Jesse-Islam/ATGC_survival_presentation_Feb.27.2020 |
|
|
599 |
|
|
|
600 |
**Questions?** |
|
|
601 |
|
|
|
602 |
</center> |
|
|
603 |
|
|
|
604 |
|