Switch to unified view

a b/vignettes/customisation.Rmd
1
---
2
title: "Using custom priors, likelihood, or movements in outbreaker2"
3
author: "Thibaut Jombart"
4
date: "`r Sys.Date()`"
5
output:
6
   rmarkdown::html_vignette:
7
     toc: true
8
     toc_depth: 2
9
vignette: >
10
  %\VignetteIndexEntry{Using custom priors, likelihood, or movements in outbreaker2}
11
  %\VignetteEngine{knitr::rmarkdown}
12
  %\VignetteEncoding{UTF-8}
13
---
14
15
16
17
```{r, echo = FALSE}
18
knitr::opts_chunk$set(
19
  collapse = TRUE,
20
  comment = "#>", 
21
  fig.width=8, 
22
  fig.height=5, 
23
  fig.path="figs-customisation/"
24
)
25
```
26
27
28
In this vignette, we show how custom functions for priors, likelihood, or
29
movement of parameters and augmented data can be used in *outbreaker2*. In all
30
these functions, the process will be similar:
31
32
1. write your own function with the right arguments
33
2. pass this function as an argument to a `custom...` function
34
3. pass the result to *outbreaker2*
35
36
Note that 2-3 can be a single step if passing the function to the arguments of
37
*outbreaker2* directly. Also note that **all priors and likelihoods are expected
38
on a log scale**. Finally, also note that while the various `custom...`
39
functions will try to some extent to check that the provided functions are
40
valid, such tests are very difficult to implement. In short: you are using these
41
custom features at your own risks - make sure these functions work before
42
passing them to *outbreaker2*.
43
44
45
46
47
<br>
48
49
# Customising priors
50
51
Priors of *outbreaker2* must be a function of an `outbreaker_param` list (see
52
`?outbreaker_param`). Here, we decide to use a step function rather than the
53
default Beta function as a prior for *pi*, the reporting probability, and a flat
54
prior between 0 and 1 for the mutation rate (which is technically a probability
55
in the basic genetic model used in *outbreaker2*).
56
57
58
We start by defining two functions: an auxiliary function `f` which returns
59
values on the natural scale, and which we can use for plotting the prior
60
distribution, and then a function `f_pi` which will be used for the
61
customisation.
62
63
```{r, f_pi}
64
f <- function(pi) {
65
    ifelse(pi < 0.8, 0, 5)
66
}
67
68
f_pi <- function(param) { 
69
    log(f(param$pi))
70
}
71
72
plot(f, type = "s", col = "blue", 
73
     xlab = expression(pi), ylab = expression(p(pi)), 
74
     main = expression(paste("New prior for ", pi)))
75
76
```
77
78
79
While `f` is a useful function to visualise the prior, `f_pi` is the function
80
which will be passed to `outbreaker`. To do so, we pass it to `custom_priors`:
81
82
```{r, custom_prior}
83
library(outbreaker2)
84
85
f_mu <- function(param) {
86
  if (param$mu < 0 || param$mu > 1) {
87
    return(-Inf)
88
  } else {
89
    return(0.0)
90
  }
91
  
92
}
93
94
priors <- custom_priors(pi = f_pi, mu = f_mu)
95
priors
96
97
```
98
99
Note that `custom_priors` does more than just adding the custom function to a
100
list. For instance, the following customisations are all wrong, and rightfully
101
rejected:
102
103
```{r, wrong_prior, error = TRUE}
104
105
## wrong: not a function
106
## should be pi = function(x){0.0}
107
custom_priors(pi = 0.0)
108
109
## wrong: two arguments
110
custom_priors(pi = function(x, y){0.0})
111
112
```
113
114
We can now use the new priors to run `outbreaker` on the `fake_outbreak` data
115
(see [*introduction vignette*](introduction.html)):
116
117
```{r, run_custom_priors}
118
119
dna <- fake_outbreak$dna
120
dates <- fake_outbreak$sample
121
w <- fake_outbreak$w
122
data <- outbreaker_data(dna = dna, dates = dates, w_dens = w)
123
124
## we set the seed to ensure results won't change
125
set.seed(1)
126
127
128
res <- outbreaker(data = data, priors = priors)
129
130
```
131
132
We can check the results first by looking at the traces, and then by plotting
133
the posterior distributions of `pi` and `mu`, respectively:
134
135
```{r, traces_custom_priors}
136
137
plot(res)
138
plot(res, "pi", burnin = 500)
139
plot(res, "mu", burnin = 500)
140
plot(res, "pi", type = "density", burnin = 500)
141
plot(res, "mu", type = "hist", burnin = 500)
142
143
```
144
145
Note that we are using density and histograms here for illustrative purposes,
146
but there is no reason to prefer one or the other for a specific
147
parameter.
148
149
150
Interestingly, the trace of `pi` suggests that the MCMC oscillates between two
151
different states, on either bound of the interval on which the prior is positive
152
(it is `-Inf` outside (0.8; 1)). This may be a consequence of the step function,
153
which causes sharp 'cliffs' in the posterior landscape. What shall one do to
154
derive good samples from the posterior distribution in this kind of situation?
155
There are several options, which in fact apply to typical cases of multi-modal
156
posterior distributions:
157
158
- Avoid 'cliffs', i.e. sharp drops in the posterior landscape, typically created
159
  by using step-functions in likelihoods and in priors.
160
161
- Use larger samples, i.e. run more MCMC iterations.
162
163
- Use a different sampler, better than Metropolis-Hasting at deriving samples
164
  from multi-modal distributions.
165
166
167
Because we know what the real transmission tree is for this dataset, we can
168
assess how the new priors impacted the inference of the transmission tree.
169
170
```{r, tree_custom_priors}
171
172
summary(res, burnin = 500)
173
tree <- summary(res, burnin = 500)$tree
174
175
comparison <- data.frame(case = 1:30,
176
                         inferred = paste(tree$from),
177
             true = paste(fake_outbreak$ances),
178
             stringsAsFactors = FALSE)
179
             
180
comparison$correct <- comparison$inferred == comparison$true
181
comparison
182
mean(comparison$correct)
183
184
```
185
186
187
188
189
190
<br>
191
192
# Customizing likelihood
193
194
Likelihood functions customisation works identically to prior functions. The
195
only difference is that custom functions will take two arguments (`data` and
196
`param`) instead of one in the prior functions. The function used to specify
197
custom likelihood is `custom_likelihoods`. Each custom function will correspond
198
to a specific likelihood component:
199
200
```{r, likelihood_components}
201
202
custom_likelihoods()
203
204
```
205
206
see `?custom_likelihoods` for details of these components, and see the section
207
'Extending the model' for new, other components. As for `custom_priors`, a few
208
checks are performed by `custom_likelihoods`:
209
210
```{r, wrong_likelihood, error = TRUE}
211
212
## wrong: not a function
213
custom_likelihoods(genetic = "fubar")
214
215
## wrong: only one argument
216
custom_likelihoods(genetic = function(x){ 0.0 })
217
218
```
219
220
A trivial customisation is to disable some or all of the likelihood components
221
of the model by returning a finite constant. Here, we apply this to two cases:
222
first, we will disable all likelihood components as a sanity check, making sure
223
that the transmission tree landscape is explored freely by the MCMC. Second, we
224
will recreate the [Wallinga & Teunis (1994)](http://dx.doi.org/10.1093/aje/kwh255)
225
model, by disabling specific components.
226
227
228
229
## A null model
230
231
```{r, null_model}
232
233
f_null <- function(data, param) {
234
   return(0.0)
235
}
236
237
null_model <- custom_likelihoods(genetic = f_null,
238
                                 timing_sampling = f_null,
239
                                 timing_infections = f_null,
240
                                 reporting = f_null,
241
                                 contact = f_null)
242
243
null_model
244
245
```
246
247
248
We also specify settings via the `config` argument to avoid detecting imported
249
cases, reduce the number of iterations and sampling each of them:
250
251
```{r, run_null_model, cache = TRUE}
252
253
null_config <- list(find_import = FALSE,
254
n_iter = 500,
255
sample_every = 1)
256
257
set.seed(1)
258
259
res_null <- outbreaker(data = data,
260
config = null_config,
261
likelihoods = null_model)
262
263
```
264
265
266
```{r, res_null_model}
267
268
plot(res_null)
269
plot(res_null, "pi")
270
plot(res_null, "mu")
271
272
```
273
274
By typical MCMC standards, these traces look appalling, as they haven't reach
275
stationarity (i.e. same mean and variance over time), and are grossly
276
autocorrelated in parts. Fair enough, as these are only the first 500 iterations
277
of the MCMC, so that autocorrelation is expected. In fact, what we observe here
278
literally is the random walk across the posterior landscape, which in this case
279
is only impacted by the priors.
280
281
282
We can check that transmission trees are indeed freely explored:
283
284
```{r, null_trees}
285
286
plot(res_null, type = "alpha")
287
288
```
289
290
Do **not** try to render the corresponding network using `plot(..., type =
291
"network")` as the force-direction algorithm will go insane. However, this
292
network can be visualised using *igraph*, extracting the edges and nodes from
293
the plot (without displaying it):
294
295
```{r, null_net}
296
297
## extract nodes and edges from the visNetwork object
298
temp <- plot(res_null, type = "network", min_support = 0)
299
class(temp)
300
head(temp$x$edges)
301
head(temp$x$nodes)
302
303
## make an igraph object
304
library(igraph)
305
306
net_null <- graph.data.frame(temp$x$edges,
307
                             vertices = temp$x$nodes[1:4])
308
309
plot(net_null, layout = layout.circle,
310
     main = "Null model, posterior trees")
311
312
```
313
314
We can derive similar diagnostics for the number of generations between cases
315
(`kappa`), only constrained by default settings to be between 1 and 5, and for
316
the infection dates (*t_inf*):
317
318
```{r, res_null_diag}
319
320
plot(res_null, type = "kappa")
321
plot(res_null, type = "t_inf")
322
323
```
324
325
Finally, we can verify that the distributions of `mu` and `pi` match their
326
priors, respectively an exponential distribution with rate 1000 and a beta with
327
parameters 10 and 1. Here, we get a qualitative assessment by comparing the
328
observed distribution (histograms) to the densities of similar sized random
329
samples from the priors:
330
331
```{r, res_null_priors}
332
333
par(xpd=TRUE)
334
hist(res_null$mu, prob = TRUE, col = "grey",
335
     border = "white",
336
     main = "Distribution of mu")
337
338
invisible(replicate(30,
339
     points(density(rexp(500, 1000)), type = "l", col = "blue")))
340
341
342
hist(res_null$pi, prob = TRUE, col = "grey",
343
     border = "white", main = "Distribution of pi")
344
345
invisible(replicate(30,
346
     points(density(rbeta(500, 10, 1)), type = "l", col = "blue")))
347
348
```
349
350
351
352
353
354
355
<br>
356
357
## A model using symptom onset only
358
359
We can use data and likelihood customisation to change the default *outbreaker2*
360
model into a [Wallinga & Teunis (1994)](http://dx.doi.org/10.1093/aje/kwh255)
361
model. To do so, we need to:
362
363
- Remove the DNA sequences from the data; alternatively we could also specify a
364
  'null' function (i.e. returning a finite constant, as above) for the genetic
365
  likelihood.
366
367
- Disable all likelihood components other than `timing_infections` using
368
  `custom_likelihoods`. This means that the dates provided will be treated as
369
  dates of symptom onset, and the timing distribution `w` will be taken as the
370
  serial interval.
371
372
- Disable the detection of imported cases, and forcing all `kappa` values to be
373
  1.
374
375
376
377
While these are fairly major changes, they are straightforward to implement. We
378
first create the dataset and custom likelihood functions:
379
380
```{r, wt_custom}
381
382
onset_data <- outbreaker_data(dates = fake_outbreak$onset,
383
                          w_dens = fake_outbreak$w)
384
385
wt_model <- custom_likelihoods(timing_sampling = f_null,
386
                               reporting = f_null)
387
388
```
389
390
To fix parameters or augmented data (here, fix all `kappa` values to 1), we set
391
the initial states to the desired values and disable the corresponding moves:
392
393
```{r, wt_config}
394
395
wt_config <- create_config(init_kappa = 1, move_kappa = FALSE,
396
                           init_pi = 1, move_pi = FALSE,
397
                           move_mu = FALSE)
398
399
```
400
401
We can now run the analyses for this new model:
402
```{r, run_wt, cache = TRUE}
403
404
set.seed(1)
405
res_wt <- outbreaker(data = onset_data,
406
                     config = wt_config,
407
                     likelihoods = wt_model)
408
               
409
```
410
411
412
```{r, res_wt}
413
414
plot(res_wt)
415
plot(res_wt, burnin = 500)
416
plot(res_wt, burnin = 500, type = "alpha")
417
summary(res_wt)
418
419
```
420
421
As before for the 'null' model, the transmission tree is very poorly resolved in
422
this case. We use the same approach to visualise it: extract nodes and edges
423
from the `visNetork` object, use this information to create an `igraph` object,
424
and visualise the result using a circular layout:
425
426
```{r, wt_net}
427
428
## extract nodes and edges from the visNetwork object
429
temp <- plot(res_wt, type = "network", min_support = 0.05)
430
class(temp)
431
head(temp$x$edges)
432
head(temp$x$nodes)
433
434
## make an igraph object
435
436
net_wt <- graph.data.frame(temp$x$edges,
437
                             vertices = temp$x$nodes[1:4])
438
                 
439
plot(net_wt, layout = layout.circle,
440
     main = "WT model, posterior trees")
441
442
```
443
444
445
446
447
448
449
<br>
450
451
# Customising movements
452
453
Customising movements works in similar ways to priors and likelihoods. In
454
practice, this type of customisation is more complex as it most likely will
455
require evaluation of likelihoods and priors, and therefore require the user to
456
know which functions to all, and how. These are documented in the [API
457
vignette](Rcpp_API.html). In the following, we provide two examples:
458
459
- a (fake) Gibbs sampler for the movement of the mutation rate `mu`
460
461
- a new 'naive' movement of ancestries in which infectors are picked at random
462
  from all cases
463
464
But before getting into these, we need to explicit how movements are happening
465
in *outbreaker2*.
466
467
468
469
## Movements in *outbreaker2*
470
471
At the core of the `outbreaker` function, movements are implemented as a list of
472
functions, which are all evaluated in turn during every iteration of the
473
MCMC. All movement functions must obey two rules:
474
475
- The first argument must be an `outbreaker_param` object (typically called
476
  `param` in the original code); see `?create_param` for details.
477
478
- All movement functions must return a valid, `outbreaker_param` object.
479
480
481
However, a new difficulty compared to prior or likelihood customisation is that
482
different movements may require different components of the model, and a
483
different set of arguments after `param`. In fact, this can be seen by examining
484
the arguments of all the default movement functions:
485
486
```{r, move_defaults}
487
488
lapply(custom_moves(),  args)
489
490
```
491
492
To handle this difficulty, *outbreaker2* transforms every movement function
493
before running the MCMC into a new function with a single parameter `param`,
494
attaching all other required argument to the function's environment. The
495
function achieving this transformation is called `bind_moves`. This function
496
'knows' what these components are for known moves listed above. For new,
497
unknown moves, it attaches the following components, provided they are used as
498
arguments of the new function:
499
500
- `data`: the processed data; see `?outbreaker_data`
501
502
- `config`: the configuration list; see `create_config`
503
504
- `likelihoods`: a list of custom likelihood functions; see
505
  `?custom_likelihoods`
506
507
- `priors`: a list of custom prior functions; see `?custom_priors`
508
509
510
See examples in `?bind_moves` for details of how this process works.
511
512
513
514
515
516
## A (fake) Gibbs sampler for `mu`
517
518
A Gibbs sampler supposes that the conditional distribution of a parameter is
519
known and can directly be sampled from. Here, we use this principle to provide a
520
toy example of custom movement for `mu`, assuming that this conditional
521
distribution is always an Exponential distribution with a rate of 1000. This is
522
easy to implement; to make sure that the function is actually used, we set a
523
global variable changed when the function is called.
524
525
526
```{r, custom_move_mu}
527
528
move_mu <- function(param, config) {
529
530
  NEW_MOVE_HAS_BEEN_USED <<- TRUE
531
  param$mu <- rexp(1, 1000)  
532
  return(param)
533
  
534
}
535
536
moves <- custom_moves(mu = move_mu)
537
538
quick_config <- list(n_iter = 500, sample_every = 1, find_import = FALSE)
539
540
```
541
542
Note that the new movement function `move_mu` has two arguments, and that we do
543
not specify `config`. Internally, what happens is:
544
545
```{r, bind_moves}
546
547
## bind quick_config to function
548
move_mu_intern <- bind_to_function(move_mu, config = quick_config)
549
550
## new function has just one argument
551
move_mu_intern
552
553
## 'config' is in the function's environment
554
names(environment(move_mu_intern))
555
556
## 'config' is actually 'quick_config'
557
identical(environment(move_mu_intern)$config, quick_config)
558
559
```
560
561
562
We perform a quick run using this new movement:
563
564
```{r, run_custom_move_mu}
565
566
NEW_MOVE_HAS_BEEN_USED <- FALSE
567
568
set.seed(1)
569
res_move_mu <- outbreaker(data, quick_config, moves = moves)
570
NEW_MOVE_HAS_BEEN_USED
571
572
plot(res_move_mu)
573
plot(res_move_mu, "pi")
574
plot(res_move_mu, "mu")
575
576
```
577
578
This short, full trace, clearly hasn't mixed well (as is to be expected). But
579
while we see the effect of accept/reject sampling (Metropolis algorithm) for
580
`pi` with a lot of autocorrelation, the trace of `mu` shows complete
581
independence between successive values. While the Gibbs sampler used here is not
582
correct, this result is: a Gibbs sampler will be more efficient than the
583
classical Metropolis(-Hasting) algorithm for a given number a iterations.
584
585
586
587
588
<br>
589
590
## A new movement of ancestries
591
592
Moves of ancestries are done in two ways in outbreaker: by picking ancestors at
593
random from any prior case, and by swapping cases from a transmission
594
link. Here, we implement a new move, which will propose infectors which have
595
been infected on the same day of the current infector. As before, we will use
596
global variables to keep track of the resulting movements (see `N_ACCEPT` and
597
`N_REJECT`).
598
599
600
```{r, new_move_ances}
601
602
api <- get_cpp_api()
603
604
new_move_ances <- function(param, data, custom_likelihoods = NULL) {
605
606
for (i in 1:data$N) {
607
    current_ances <- param$alpha[i]
608
    if (!is.na(current_ances)) {
609
      ## find cases infected on the same days
610
      current_t_inf <- param$t_inf[current_ances]
611
      pool <- which(param$t_inf == current_t_inf)
612
      pool <- setdiff(pool, i)
613
      
614
      if (length(pool) > 0) {
615
        ## propose new ancestor
616
        current_ll <- api$cpp_ll_all(data, param, i = i, custom_likelihoods)
617
        param$alpha[i] <- sample(pool, 1)
618
        new_ll <- api$cpp_ll_all(data, param, i = i, custom_likelihoods)
619
620
        ## likelihood ratio - no correction, move is symmetric
621
        ratio <- exp(new_ll - current_ll)
622
623
        ## accept / reject
624
        if (runif(1) <= ratio) { # accept
625
          N_ACCEPT <<- N_ACCEPT + 1
626
        } else { # reject
627
          N_REJECT <<- N_REJECT + 1
628
          param$alpha[i] <- current_ances
629
        }
630
      }
631
    }
632
  }
633
  return(param)
634
}
635
636
moves <- custom_moves(new_move = new_move_ances)
637
638
```
639
640
We can now use this new move in our transmission tree reconstruction. We will
641
use a shorter chain than the defaults as this new move is likely to be computer
642
intensive.
643
644
645
646
```{r, run_new_move, cache = TRUE}
647
648
N_ACCEPT <- 0
649
N_REJECT <- 0
650
651
set.seed(1)
652
res_new_move <- outbreaker(data, list(move_kappa = FALSE), moves = moves)
653
654
N_ACCEPT
655
N_REJECT
656
657
```
658
659
660
```{r, res_new_move}
661
662
plot(res_new_move)
663
plot(res_new_move, type = "alpha")
664
summary(res_new_move)
665
666
```
667
668
Results show a switch to a new mode at about 5000 iterations. Let us compare the
669
consensus tree to the actual one (store in `fake_outbreak$ances`):
670
671
```{r, check_new_move}
672
673
summary(res_new_move, burnin = 5000)
674
tree2 <- summary(res_new_move, burnin = 5000)$tree
675
676
comparison <- data.frame(case = 1:30,
677
                         inferred = paste(tree2$from),
678
             true = paste(fake_outbreak$ances),
679
             stringsAsFactors = FALSE)
680
             
681
comparison$correct <- comparison$inferred == comparison$true
682
comparison
683
mean(comparison$correct)
684
685
```