--- a +++ b/vignettes/introduction.Rmd @@ -0,0 +1,279 @@ +--- +title: "Introduction to outbreaker2" +author: "Thibaut Jombart" +date: "`r Sys.Date()`" +output: + rmarkdown::html_vignette: + toc: true + toc_depth: 2 +vignette: > + %\VignetteIndexEntry{Introduction to outbreaker2} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + + + +```{r, echo = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.width=8, + fig.height=5, + fig.path="figs-introduction/" +) +``` + + +This tutorial provides a worked example of outbreak reconstruction using +*outbreaker2*. For installation guidelines, a general overview of the package's +functionalities as well as other resources, see the 'overview' vignette: +```{r, eval=FALSE} +vignette("Overview", package = "outbreaker2") +``` + +We will be analysing a small simulated outbreak distributed with the package, +`fake_outbreak`. This dataset contains simulated dates of onsets, partial +contact tracing data and pathogen genome sequences for 30 cases: + + +```{r, data} +library(ape) +library(outbreaker2) + +col <- "#6666cc" +fake_outbreak +``` + +Here, we will use the dates of case isolation `$sample`, DNA sequences `$dna`, contact tracing data `$ctd` and the empirical distribution of the generation time `$w`, which can be visualised as: +```{r, w} + +plot(fake_outbreak$w, type = "h", xlim = c(0, 5), + lwd = 30, col = col, lend = 2, + xlab = "Days after infection", + ylab = "p(new case)", + main = "Generation time distribution") + +``` + + +<br> + +# Running the analysis with defaults + +By default, *outbreaker2* uses the settings defined by `create_config()`; see +the documentation of this function for details. Note that the main function of +*outbreaker2* is called `outbreaker` (without number). The function's arguments are: + +```{r} +args(outbreaker) +``` + +The only mandatory input really is the data. For most cases, customising the +method will be done through `config` and the function `create_config()`, which +creates default and alters settings such as prior parameters, length and rate of +sampling from the MCMC, and definition of which parameters should be estimated +('moved'). The last arguments of `outbreaker` are used to specify custom prior, +likelihood, and movement functions, and are detailed in the '*Customisation*' +vignette. + + +Let us run the analysis with default settings: + +```{r, first_run, cache = TRUE} + +dna <- fake_outbreak$dna +dates <- fake_outbreak$sample +ctd <- fake_outbreak$ctd +w <- fake_outbreak$w +data <- outbreaker_data(dna = dna, dates = dates, ctd = ctd, w_dens = w) + +## we set the seed to ensure results won't change +set.seed(1) + +res <- outbreaker(data = data) + +``` + +This analysis will take around 40 seconds on a modern computer. Note that +*outbreaker2* is slower than *outbreaker* for the same number of iterations, but +the two implementations are actually different. In particular, *outbreaker2* +performs many more moves than the original package for each iteration of the +MCMC, resulting in more efficient mixing. In short: *outbreaker2* is slower, but +it requires far less iterations. + + +Results are stored in a `data.frame` with the special class `outbreaker_chains`: +```{r} + +class(res) +dim(res) +res + +``` + +Each row of `res` contains a sample from the MCMC. For each, informations about +the step (iteration of the MCMC), log-values of posterior, likelihood and +priors, and all parameters and augmented data are returned. Ancestries +(i.e. indices of the most recent ancestral case for a given case), are indicated +by `alpha_[index of the case]`, dates of infections by `t_inf_[index of the +case]`, and number of generations between cases and their infector / ancestor by +`kappa_[index of the case]`: + +```{r} + +names(res) + +``` + + + +<br> + +# Analysing the results + +## Graphics + +Results can be visualised using `plot`, which has several options and can be +used to derive various kinds of graphics (see `?plot.outbreaker_chains`). The +basic plot shows the trace of the log-posterior values, which is useful to +assess mixing: + +```{r, basic_trace} + +plot(res) + +``` + +The second argument of `plot` can be used to visualise traces of any +other column in `res`: + +```{r, traces} + +plot(res, "prior") +plot(res, "mu") +plot(res, "t_inf_15") + +``` + +`burnin` can be used to discard the first iterations prior to mixing: + +```{r, basic_trace_burn} + +## compare this to plot(res) +plot(res, burnin = 2000) + +``` + +`type` indicates the type of graphic to plot; roughly: + +- `trace` for traces of the MCMC (default) + +- `hist`, `density` to assess distributions of quantitative values + +- `alpha`, `network` to visualise ancestries / transmission tree; note that + `network` opens up an interactive plot and requires a web browser with + Javascript enabled; the argument `min_support` is useful to select only the + most supported ancestries and avoid displaying too many links + +- `kappa` to visualise the distributions generations between cases and their + ancestor / infector + +Here are a few examples: + +```{r, many_plots} + +plot(res, "mu", "hist", burnin = 2000) + +plot(res, "mu", "density", burnin = 2000) + +plot(res, type = "alpha", burnin = 2000) + +plot(res, type = "t_inf", burnin = 2000) + +plot(res, type = "kappa", burnin = 2000) + +plot(res, type = "network", burnin = 2000, min_support = 0.01) + +``` + + + +## Using `summary` + +The summary of results derives various distributional statistics for posterior, +likelihood and prior densities, as well as for the quantitative parameters. It +also builds a consensus tree, by finding for each case the most frequent +infector / ancestor in the posterior samples. The corresponding frequencies are +reported as 'support'. The most frequent value of kappa is also reported as 'generations': + +```{r, summary} + +summary(res) + +``` + + + +<br> + +# Customising settings and priors + +As said before, most customisation can be achieved via `create_config`. +In the following, we make the following changes to the defaults: + +- increase the number of iterations to 30,000 + +- set the sampling rate to 20 + +- use a star-like initial tree + +- disable to movement of `kappa`, so that we assume that all cases have + observed + +- set a lower rate for the exponential prior of `mu` (10 instead of 1000) + + +```{r, config2, cache = TRUE} + +config2 <- create_config(n_iter = 3e4, + sample_every = 20, + init_tree ="star", + move_kappa = FALSE, + prior_mu = 10) + +set.seed(1) + +res2 <- outbreaker(data, config2) +plot(res2) +plot(res2, burnin = 2000) + +``` + +We can see that the burnin is around 2,500 iterations (i.e. after the initial +step corresponding to a local optimum). We get the consensus tree from the new +results, and compare the inferred tree to the actual ancestries stored in the +dataset (`fake_outbreak$ances`): +```{r, res2} + +summary(res2, burnin = 3000) +tree2 <- summary(res2, burnin = 3000)$tree + +comparison <- data.frame(case = 1:30, + inferred = paste(tree2$from), + true = paste(fake_outbreak$ances), + stringsAsFactors = FALSE) + +comparison$correct <- comparison$inferred == comparison$true +comparison +mean(comparison$correct) + +``` + +Let's visualise the posterior trees: + +```{r, net2} + +plot(res2, type = "network", burnin = 3000, min_support = 0.01) + +```