Switch to unified view

a b/vignettes/introduction.Rmd
1
---
2
title: "Introduction to outbreaker2"
3
author: "Thibaut Jombart"
4
date: "`r Sys.Date()`"
5
output:
6
   rmarkdown::html_vignette:
7
     toc: true
8
     toc_depth: 2
9
vignette: >
10
  %\VignetteIndexEntry{Introduction to outbreaker2}
11
  %\VignetteEngine{knitr::rmarkdown}
12
  %\VignetteEncoding{UTF-8}
13
---
14
15
16
17
```{r, echo = FALSE}
18
knitr::opts_chunk$set(
19
  collapse = TRUE,
20
  comment = "#>", 
21
  fig.width=8, 
22
  fig.height=5, 
23
  fig.path="figs-introduction/"
24
)
25
```
26
27
28
This tutorial provides a worked example of outbreak reconstruction using
29
*outbreaker2*. For installation guidelines, a general overview of the package's
30
functionalities as well as other resources, see the 'overview' vignette:
31
```{r, eval=FALSE}
32
vignette("Overview", package = "outbreaker2")
33
```
34
35
We will be analysing a small simulated outbreak distributed with the package,
36
`fake_outbreak`. This dataset contains simulated dates of onsets, partial
37
contact tracing data and pathogen genome sequences for 30 cases:
38
39
40
```{r, data}
41
library(ape)
42
library(outbreaker2)
43
44
col <- "#6666cc"
45
fake_outbreak
46
```
47
48
Here, we will use the dates of case isolation `$sample`, DNA sequences `$dna`, contact tracing data `$ctd` and the empirical distribution of the generation time `$w`, which can be visualised as:
49
```{r, w}
50
51
plot(fake_outbreak$w, type = "h", xlim = c(0, 5), 
52
     lwd = 30, col = col, lend = 2, 
53
     xlab = "Days after infection", 
54
     ylab = "p(new case)", 
55
     main = "Generation time distribution")
56
57
```
58
59
60
<br>
61
62
# Running the analysis with defaults
63
64
By default, *outbreaker2* uses the settings defined by `create_config()`; see
65
the documentation of this function for details. Note that the main function of
66
*outbreaker2* is called `outbreaker` (without number). The function's arguments are:
67
68
```{r}
69
args(outbreaker)
70
```
71
72
The only mandatory input really is the data. For most cases, customising the
73
method will be done through `config` and the function `create_config()`, which
74
creates default and alters settings such as prior parameters, length and rate of
75
sampling from the MCMC, and definition of which parameters should be estimated
76
('moved'). The last arguments of `outbreaker` are used to specify custom prior,
77
likelihood, and movement functions, and are detailed in the '*Customisation*'
78
vignette.
79
80
81
Let us run the analysis with default settings:
82
83
```{r, first_run, cache = TRUE}
84
85
dna <- fake_outbreak$dna
86
dates <- fake_outbreak$sample
87
ctd <- fake_outbreak$ctd
88
w <- fake_outbreak$w
89
data <- outbreaker_data(dna = dna, dates = dates, ctd = ctd, w_dens = w)
90
91
## we set the seed to ensure results won't change
92
set.seed(1)
93
94
res <- outbreaker(data = data)
95
96
```
97
98
This analysis will take around 40 seconds on a modern computer. Note that
99
*outbreaker2* is slower than *outbreaker* for the same number of iterations, but
100
the two implementations are actually different. In particular, *outbreaker2*
101
performs many more moves than the original package for each iteration of the
102
MCMC, resulting in more efficient mixing. In short: *outbreaker2* is slower, but
103
it requires far less iterations.
104
105
106
Results are stored in a `data.frame` with the special class `outbreaker_chains`:
107
```{r}
108
109
class(res)
110
dim(res)
111
res
112
113
```
114
115
Each row of `res` contains a sample from the MCMC. For each, informations about
116
the step (iteration of the MCMC), log-values of posterior, likelihood and
117
priors, and all parameters and augmented data are returned. Ancestries
118
(i.e. indices of the most recent ancestral case for a given case), are indicated
119
by `alpha_[index of the case]`, dates of infections by `t_inf_[index of the
120
case]`, and number of generations between cases and their infector / ancestor by
121
`kappa_[index of the case]`:
122
123
```{r}
124
125
names(res)
126
127
```
128
129
130
131
<br>
132
133
# Analysing the results
134
135
## Graphics 
136
137
Results can be visualised using `plot`, which has several options and can be
138
used to derive various kinds of graphics (see `?plot.outbreaker_chains`).  The
139
basic plot shows the trace of the log-posterior values, which is useful to
140
assess mixing:
141
142
```{r, basic_trace}
143
144
plot(res)
145
146
```
147
148
The second argument of `plot` can be used to visualise traces of any
149
other column in `res`:
150
151
```{r, traces}
152
153
plot(res, "prior")
154
plot(res, "mu")
155
plot(res, "t_inf_15")
156
157
```
158
159
`burnin` can be used to discard the first iterations prior to mixing:
160
161
```{r, basic_trace_burn}
162
163
## compare this to plot(res)
164
plot(res, burnin = 2000)
165
166
```
167
168
`type` indicates the type of graphic to plot; roughly:
169
170
- `trace` for traces of the MCMC (default)
171
172
- `hist`, `density` to assess distributions of quantitative values
173
174
- `alpha`, `network` to visualise ancestries / transmission tree; note that
175
  `network` opens up an interactive plot and requires a web browser with
176
  Javascript enabled; the argument `min_support` is useful to select only the
177
  most supported ancestries and avoid displaying too many links
178
179
- `kappa` to visualise the distributions generations between cases and their
180
  ancestor / infector
181
182
Here are a few examples:
183
184
```{r, many_plots}
185
186
plot(res, "mu", "hist", burnin = 2000)
187
188
plot(res, "mu", "density", burnin = 2000)
189
190
plot(res, type = "alpha", burnin = 2000)
191
192
plot(res, type = "t_inf", burnin = 2000)
193
194
plot(res, type = "kappa", burnin = 2000)
195
196
plot(res, type = "network", burnin = 2000, min_support = 0.01)
197
198
```
199
200
201
202
## Using `summary`
203
204
The summary of results derives various distributional statistics for posterior,
205
likelihood and prior densities, as well as for the quantitative parameters. It
206
also builds a consensus tree, by finding for each case the most frequent
207
infector / ancestor in the posterior samples. The corresponding frequencies are
208
reported as 'support'. The most frequent value of kappa is also reported as 'generations':
209
210
```{r, summary}
211
212
summary(res)
213
214
```
215
216
217
218
<br>
219
220
# Customising settings and priors
221
222
As said before, most customisation can be achieved via `create_config`.
223
In the following, we make the following changes to the defaults:
224
225
- increase the number of iterations to 30,000
226
227
- set the sampling rate to 20
228
229
- use a star-like initial tree
230
231
- disable to movement of `kappa`, so that we assume that all cases have
232
  observed
233
  
234
- set a lower rate for the exponential prior of `mu` (10 instead of 1000)
235
236
237
```{r, config2, cache = TRUE}
238
239
config2 <- create_config(n_iter = 3e4,
240
                         sample_every = 20,
241
                 init_tree ="star",
242
             move_kappa = FALSE,
243
             prior_mu = 10)
244
245
set.seed(1)
246
247
res2 <- outbreaker(data, config2)
248
plot(res2)
249
plot(res2, burnin = 2000)
250
251
```
252
253
We can see that the burnin is around 2,500 iterations (i.e. after the initial
254
step corresponding to a local optimum).  We get the consensus tree from the new
255
results, and compare the inferred tree to the actual ancestries stored in the
256
dataset (`fake_outbreak$ances`):
257
```{r, res2}
258
259
summary(res2, burnin = 3000)
260
tree2 <- summary(res2, burnin = 3000)$tree
261
262
comparison <- data.frame(case = 1:30,
263
                         inferred = paste(tree2$from),
264
             true = paste(fake_outbreak$ances),
265
             stringsAsFactors = FALSE)
266
             
267
comparison$correct <- comparison$inferred == comparison$true
268
comparison
269
mean(comparison$correct)
270
271
```
272
273
Let's visualise the posterior trees:
274
275
```{r, net2}
276
277
plot(res2, type = "network", burnin = 3000, min_support = 0.01)
278
279
```