|
a |
|
b/vignettes/introduction.Rmd |
|
|
1 |
--- |
|
|
2 |
title: "Introduction to outbreaker2" |
|
|
3 |
author: "Thibaut Jombart" |
|
|
4 |
date: "`r Sys.Date()`" |
|
|
5 |
output: |
|
|
6 |
rmarkdown::html_vignette: |
|
|
7 |
toc: true |
|
|
8 |
toc_depth: 2 |
|
|
9 |
vignette: > |
|
|
10 |
%\VignetteIndexEntry{Introduction to outbreaker2} |
|
|
11 |
%\VignetteEngine{knitr::rmarkdown} |
|
|
12 |
%\VignetteEncoding{UTF-8} |
|
|
13 |
--- |
|
|
14 |
|
|
|
15 |
|
|
|
16 |
|
|
|
17 |
```{r, echo = FALSE} |
|
|
18 |
knitr::opts_chunk$set( |
|
|
19 |
collapse = TRUE, |
|
|
20 |
comment = "#>", |
|
|
21 |
fig.width=8, |
|
|
22 |
fig.height=5, |
|
|
23 |
fig.path="figs-introduction/" |
|
|
24 |
) |
|
|
25 |
``` |
|
|
26 |
|
|
|
27 |
|
|
|
28 |
This tutorial provides a worked example of outbreak reconstruction using |
|
|
29 |
*outbreaker2*. For installation guidelines, a general overview of the package's |
|
|
30 |
functionalities as well as other resources, see the 'overview' vignette: |
|
|
31 |
```{r, eval=FALSE} |
|
|
32 |
vignette("Overview", package = "outbreaker2") |
|
|
33 |
``` |
|
|
34 |
|
|
|
35 |
We will be analysing a small simulated outbreak distributed with the package, |
|
|
36 |
`fake_outbreak`. This dataset contains simulated dates of onsets, partial |
|
|
37 |
contact tracing data and pathogen genome sequences for 30 cases: |
|
|
38 |
|
|
|
39 |
|
|
|
40 |
```{r, data} |
|
|
41 |
library(ape) |
|
|
42 |
library(outbreaker2) |
|
|
43 |
|
|
|
44 |
col <- "#6666cc" |
|
|
45 |
fake_outbreak |
|
|
46 |
``` |
|
|
47 |
|
|
|
48 |
Here, we will use the dates of case isolation `$sample`, DNA sequences `$dna`, contact tracing data `$ctd` and the empirical distribution of the generation time `$w`, which can be visualised as: |
|
|
49 |
```{r, w} |
|
|
50 |
|
|
|
51 |
plot(fake_outbreak$w, type = "h", xlim = c(0, 5), |
|
|
52 |
lwd = 30, col = col, lend = 2, |
|
|
53 |
xlab = "Days after infection", |
|
|
54 |
ylab = "p(new case)", |
|
|
55 |
main = "Generation time distribution") |
|
|
56 |
|
|
|
57 |
``` |
|
|
58 |
|
|
|
59 |
|
|
|
60 |
<br> |
|
|
61 |
|
|
|
62 |
# Running the analysis with defaults |
|
|
63 |
|
|
|
64 |
By default, *outbreaker2* uses the settings defined by `create_config()`; see |
|
|
65 |
the documentation of this function for details. Note that the main function of |
|
|
66 |
*outbreaker2* is called `outbreaker` (without number). The function's arguments are: |
|
|
67 |
|
|
|
68 |
```{r} |
|
|
69 |
args(outbreaker) |
|
|
70 |
``` |
|
|
71 |
|
|
|
72 |
The only mandatory input really is the data. For most cases, customising the |
|
|
73 |
method will be done through `config` and the function `create_config()`, which |
|
|
74 |
creates default and alters settings such as prior parameters, length and rate of |
|
|
75 |
sampling from the MCMC, and definition of which parameters should be estimated |
|
|
76 |
('moved'). The last arguments of `outbreaker` are used to specify custom prior, |
|
|
77 |
likelihood, and movement functions, and are detailed in the '*Customisation*' |
|
|
78 |
vignette. |
|
|
79 |
|
|
|
80 |
|
|
|
81 |
Let us run the analysis with default settings: |
|
|
82 |
|
|
|
83 |
```{r, first_run, cache = TRUE} |
|
|
84 |
|
|
|
85 |
dna <- fake_outbreak$dna |
|
|
86 |
dates <- fake_outbreak$sample |
|
|
87 |
ctd <- fake_outbreak$ctd |
|
|
88 |
w <- fake_outbreak$w |
|
|
89 |
data <- outbreaker_data(dna = dna, dates = dates, ctd = ctd, w_dens = w) |
|
|
90 |
|
|
|
91 |
## we set the seed to ensure results won't change |
|
|
92 |
set.seed(1) |
|
|
93 |
|
|
|
94 |
res <- outbreaker(data = data) |
|
|
95 |
|
|
|
96 |
``` |
|
|
97 |
|
|
|
98 |
This analysis will take around 40 seconds on a modern computer. Note that |
|
|
99 |
*outbreaker2* is slower than *outbreaker* for the same number of iterations, but |
|
|
100 |
the two implementations are actually different. In particular, *outbreaker2* |
|
|
101 |
performs many more moves than the original package for each iteration of the |
|
|
102 |
MCMC, resulting in more efficient mixing. In short: *outbreaker2* is slower, but |
|
|
103 |
it requires far less iterations. |
|
|
104 |
|
|
|
105 |
|
|
|
106 |
Results are stored in a `data.frame` with the special class `outbreaker_chains`: |
|
|
107 |
```{r} |
|
|
108 |
|
|
|
109 |
class(res) |
|
|
110 |
dim(res) |
|
|
111 |
res |
|
|
112 |
|
|
|
113 |
``` |
|
|
114 |
|
|
|
115 |
Each row of `res` contains a sample from the MCMC. For each, informations about |
|
|
116 |
the step (iteration of the MCMC), log-values of posterior, likelihood and |
|
|
117 |
priors, and all parameters and augmented data are returned. Ancestries |
|
|
118 |
(i.e. indices of the most recent ancestral case for a given case), are indicated |
|
|
119 |
by `alpha_[index of the case]`, dates of infections by `t_inf_[index of the |
|
|
120 |
case]`, and number of generations between cases and their infector / ancestor by |
|
|
121 |
`kappa_[index of the case]`: |
|
|
122 |
|
|
|
123 |
```{r} |
|
|
124 |
|
|
|
125 |
names(res) |
|
|
126 |
|
|
|
127 |
``` |
|
|
128 |
|
|
|
129 |
|
|
|
130 |
|
|
|
131 |
<br> |
|
|
132 |
|
|
|
133 |
# Analysing the results |
|
|
134 |
|
|
|
135 |
## Graphics |
|
|
136 |
|
|
|
137 |
Results can be visualised using `plot`, which has several options and can be |
|
|
138 |
used to derive various kinds of graphics (see `?plot.outbreaker_chains`). The |
|
|
139 |
basic plot shows the trace of the log-posterior values, which is useful to |
|
|
140 |
assess mixing: |
|
|
141 |
|
|
|
142 |
```{r, basic_trace} |
|
|
143 |
|
|
|
144 |
plot(res) |
|
|
145 |
|
|
|
146 |
``` |
|
|
147 |
|
|
|
148 |
The second argument of `plot` can be used to visualise traces of any |
|
|
149 |
other column in `res`: |
|
|
150 |
|
|
|
151 |
```{r, traces} |
|
|
152 |
|
|
|
153 |
plot(res, "prior") |
|
|
154 |
plot(res, "mu") |
|
|
155 |
plot(res, "t_inf_15") |
|
|
156 |
|
|
|
157 |
``` |
|
|
158 |
|
|
|
159 |
`burnin` can be used to discard the first iterations prior to mixing: |
|
|
160 |
|
|
|
161 |
```{r, basic_trace_burn} |
|
|
162 |
|
|
|
163 |
## compare this to plot(res) |
|
|
164 |
plot(res, burnin = 2000) |
|
|
165 |
|
|
|
166 |
``` |
|
|
167 |
|
|
|
168 |
`type` indicates the type of graphic to plot; roughly: |
|
|
169 |
|
|
|
170 |
- `trace` for traces of the MCMC (default) |
|
|
171 |
|
|
|
172 |
- `hist`, `density` to assess distributions of quantitative values |
|
|
173 |
|
|
|
174 |
- `alpha`, `network` to visualise ancestries / transmission tree; note that |
|
|
175 |
`network` opens up an interactive plot and requires a web browser with |
|
|
176 |
Javascript enabled; the argument `min_support` is useful to select only the |
|
|
177 |
most supported ancestries and avoid displaying too many links |
|
|
178 |
|
|
|
179 |
- `kappa` to visualise the distributions generations between cases and their |
|
|
180 |
ancestor / infector |
|
|
181 |
|
|
|
182 |
Here are a few examples: |
|
|
183 |
|
|
|
184 |
```{r, many_plots} |
|
|
185 |
|
|
|
186 |
plot(res, "mu", "hist", burnin = 2000) |
|
|
187 |
|
|
|
188 |
plot(res, "mu", "density", burnin = 2000) |
|
|
189 |
|
|
|
190 |
plot(res, type = "alpha", burnin = 2000) |
|
|
191 |
|
|
|
192 |
plot(res, type = "t_inf", burnin = 2000) |
|
|
193 |
|
|
|
194 |
plot(res, type = "kappa", burnin = 2000) |
|
|
195 |
|
|
|
196 |
plot(res, type = "network", burnin = 2000, min_support = 0.01) |
|
|
197 |
|
|
|
198 |
``` |
|
|
199 |
|
|
|
200 |
|
|
|
201 |
|
|
|
202 |
## Using `summary` |
|
|
203 |
|
|
|
204 |
The summary of results derives various distributional statistics for posterior, |
|
|
205 |
likelihood and prior densities, as well as for the quantitative parameters. It |
|
|
206 |
also builds a consensus tree, by finding for each case the most frequent |
|
|
207 |
infector / ancestor in the posterior samples. The corresponding frequencies are |
|
|
208 |
reported as 'support'. The most frequent value of kappa is also reported as 'generations': |
|
|
209 |
|
|
|
210 |
```{r, summary} |
|
|
211 |
|
|
|
212 |
summary(res) |
|
|
213 |
|
|
|
214 |
``` |
|
|
215 |
|
|
|
216 |
|
|
|
217 |
|
|
|
218 |
<br> |
|
|
219 |
|
|
|
220 |
# Customising settings and priors |
|
|
221 |
|
|
|
222 |
As said before, most customisation can be achieved via `create_config`. |
|
|
223 |
In the following, we make the following changes to the defaults: |
|
|
224 |
|
|
|
225 |
- increase the number of iterations to 30,000 |
|
|
226 |
|
|
|
227 |
- set the sampling rate to 20 |
|
|
228 |
|
|
|
229 |
- use a star-like initial tree |
|
|
230 |
|
|
|
231 |
- disable to movement of `kappa`, so that we assume that all cases have |
|
|
232 |
observed |
|
|
233 |
|
|
|
234 |
- set a lower rate for the exponential prior of `mu` (10 instead of 1000) |
|
|
235 |
|
|
|
236 |
|
|
|
237 |
```{r, config2, cache = TRUE} |
|
|
238 |
|
|
|
239 |
config2 <- create_config(n_iter = 3e4, |
|
|
240 |
sample_every = 20, |
|
|
241 |
init_tree ="star", |
|
|
242 |
move_kappa = FALSE, |
|
|
243 |
prior_mu = 10) |
|
|
244 |
|
|
|
245 |
set.seed(1) |
|
|
246 |
|
|
|
247 |
res2 <- outbreaker(data, config2) |
|
|
248 |
plot(res2) |
|
|
249 |
plot(res2, burnin = 2000) |
|
|
250 |
|
|
|
251 |
``` |
|
|
252 |
|
|
|
253 |
We can see that the burnin is around 2,500 iterations (i.e. after the initial |
|
|
254 |
step corresponding to a local optimum). We get the consensus tree from the new |
|
|
255 |
results, and compare the inferred tree to the actual ancestries stored in the |
|
|
256 |
dataset (`fake_outbreak$ances`): |
|
|
257 |
```{r, res2} |
|
|
258 |
|
|
|
259 |
summary(res2, burnin = 3000) |
|
|
260 |
tree2 <- summary(res2, burnin = 3000)$tree |
|
|
261 |
|
|
|
262 |
comparison <- data.frame(case = 1:30, |
|
|
263 |
inferred = paste(tree2$from), |
|
|
264 |
true = paste(fake_outbreak$ances), |
|
|
265 |
stringsAsFactors = FALSE) |
|
|
266 |
|
|
|
267 |
comparison$correct <- comparison$inferred == comparison$true |
|
|
268 |
comparison |
|
|
269 |
mean(comparison$correct) |
|
|
270 |
|
|
|
271 |
``` |
|
|
272 |
|
|
|
273 |
Let's visualise the posterior trees: |
|
|
274 |
|
|
|
275 |
```{r, net2} |
|
|
276 |
|
|
|
277 |
plot(res2, type = "network", burnin = 3000, min_support = 0.01) |
|
|
278 |
|
|
|
279 |
``` |