--- a +++ b/vignettes/customisation.Rmd @@ -0,0 +1,685 @@ +--- +title: "Using custom priors, likelihood, or movements in outbreaker2" +author: "Thibaut Jombart" +date: "`r Sys.Date()`" +output: + rmarkdown::html_vignette: + toc: true + toc_depth: 2 +vignette: > + %\VignetteIndexEntry{Using custom priors, likelihood, or movements in outbreaker2} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + + + +```{r, echo = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.width=8, + fig.height=5, + fig.path="figs-customisation/" +) +``` + + +In this vignette, we show how custom functions for priors, likelihood, or +movement of parameters and augmented data can be used in *outbreaker2*. In all +these functions, the process will be similar: + +1. write your own function with the right arguments +2. pass this function as an argument to a `custom...` function +3. pass the result to *outbreaker2* + +Note that 2-3 can be a single step if passing the function to the arguments of +*outbreaker2* directly. Also note that **all priors and likelihoods are expected +on a log scale**. Finally, also note that while the various `custom...` +functions will try to some extent to check that the provided functions are +valid, such tests are very difficult to implement. In short: you are using these +custom features at your own risks - make sure these functions work before +passing them to *outbreaker2*. + + + + +<br> + +# Customising priors + +Priors of *outbreaker2* must be a function of an `outbreaker_param` list (see +`?outbreaker_param`). Here, we decide to use a step function rather than the +default Beta function as a prior for *pi*, the reporting probability, and a flat +prior between 0 and 1 for the mutation rate (which is technically a probability +in the basic genetic model used in *outbreaker2*). + + +We start by defining two functions: an auxiliary function `f` which returns +values on the natural scale, and which we can use for plotting the prior +distribution, and then a function `f_pi` which will be used for the +customisation. + +```{r, f_pi} +f <- function(pi) { + ifelse(pi < 0.8, 0, 5) +} + +f_pi <- function(param) { + log(f(param$pi)) +} + +plot(f, type = "s", col = "blue", + xlab = expression(pi), ylab = expression(p(pi)), + main = expression(paste("New prior for ", pi))) + +``` + + +While `f` is a useful function to visualise the prior, `f_pi` is the function +which will be passed to `outbreaker`. To do so, we pass it to `custom_priors`: + +```{r, custom_prior} +library(outbreaker2) + +f_mu <- function(param) { + if (param$mu < 0 || param$mu > 1) { + return(-Inf) + } else { + return(0.0) + } + +} + +priors <- custom_priors(pi = f_pi, mu = f_mu) +priors + +``` + +Note that `custom_priors` does more than just adding the custom function to a +list. For instance, the following customisations are all wrong, and rightfully +rejected: + +```{r, wrong_prior, error = TRUE} + +## wrong: not a function +## should be pi = function(x){0.0} +custom_priors(pi = 0.0) + +## wrong: two arguments +custom_priors(pi = function(x, y){0.0}) + +``` + +We can now use the new priors to run `outbreaker` on the `fake_outbreak` data +(see [*introduction vignette*](introduction.html)): + +```{r, run_custom_priors} + +dna <- fake_outbreak$dna +dates <- fake_outbreak$sample +w <- fake_outbreak$w +data <- outbreaker_data(dna = dna, dates = dates, w_dens = w) + +## we set the seed to ensure results won't change +set.seed(1) + + +res <- outbreaker(data = data, priors = priors) + +``` + +We can check the results first by looking at the traces, and then by plotting +the posterior distributions of `pi` and `mu`, respectively: + +```{r, traces_custom_priors} + +plot(res) +plot(res, "pi", burnin = 500) +plot(res, "mu", burnin = 500) +plot(res, "pi", type = "density", burnin = 500) +plot(res, "mu", type = "hist", burnin = 500) + +``` + +Note that we are using density and histograms here for illustrative purposes, +but there is no reason to prefer one or the other for a specific +parameter. + + +Interestingly, the trace of `pi` suggests that the MCMC oscillates between two +different states, on either bound of the interval on which the prior is positive +(it is `-Inf` outside (0.8; 1)). This may be a consequence of the step function, +which causes sharp 'cliffs' in the posterior landscape. What shall one do to +derive good samples from the posterior distribution in this kind of situation? +There are several options, which in fact apply to typical cases of multi-modal +posterior distributions: + +- Avoid 'cliffs', i.e. sharp drops in the posterior landscape, typically created + by using step-functions in likelihoods and in priors. + +- Use larger samples, i.e. run more MCMC iterations. + +- Use a different sampler, better than Metropolis-Hasting at deriving samples + from multi-modal distributions. + + +Because we know what the real transmission tree is for this dataset, we can +assess how the new priors impacted the inference of the transmission tree. + +```{r, tree_custom_priors} + +summary(res, burnin = 500) +tree <- summary(res, burnin = 500)$tree + +comparison <- data.frame(case = 1:30, + inferred = paste(tree$from), + true = paste(fake_outbreak$ances), + stringsAsFactors = FALSE) + +comparison$correct <- comparison$inferred == comparison$true +comparison +mean(comparison$correct) + +``` + + + + + +<br> + +# Customizing likelihood + +Likelihood functions customisation works identically to prior functions. The +only difference is that custom functions will take two arguments (`data` and +`param`) instead of one in the prior functions. The function used to specify +custom likelihood is `custom_likelihoods`. Each custom function will correspond +to a specific likelihood component: + +```{r, likelihood_components} + +custom_likelihoods() + +``` + +see `?custom_likelihoods` for details of these components, and see the section +'Extending the model' for new, other components. As for `custom_priors`, a few +checks are performed by `custom_likelihoods`: + +```{r, wrong_likelihood, error = TRUE} + +## wrong: not a function +custom_likelihoods(genetic = "fubar") + +## wrong: only one argument +custom_likelihoods(genetic = function(x){ 0.0 }) + +``` + +A trivial customisation is to disable some or all of the likelihood components +of the model by returning a finite constant. Here, we apply this to two cases: +first, we will disable all likelihood components as a sanity check, making sure +that the transmission tree landscape is explored freely by the MCMC. Second, we +will recreate the [Wallinga & Teunis (1994)](http://dx.doi.org/10.1093/aje/kwh255) +model, by disabling specific components. + + + +## A null model + +```{r, null_model} + +f_null <- function(data, param) { + return(0.0) +} + +null_model <- custom_likelihoods(genetic = f_null, + timing_sampling = f_null, + timing_infections = f_null, + reporting = f_null, + contact = f_null) + +null_model + +``` + + +We also specify settings via the `config` argument to avoid detecting imported +cases, reduce the number of iterations and sampling each of them: + +```{r, run_null_model, cache = TRUE} + +null_config <- list(find_import = FALSE, +n_iter = 500, +sample_every = 1) + +set.seed(1) + +res_null <- outbreaker(data = data, +config = null_config, +likelihoods = null_model) + +``` + + +```{r, res_null_model} + +plot(res_null) +plot(res_null, "pi") +plot(res_null, "mu") + +``` + +By typical MCMC standards, these traces look appalling, as they haven't reach +stationarity (i.e. same mean and variance over time), and are grossly +autocorrelated in parts. Fair enough, as these are only the first 500 iterations +of the MCMC, so that autocorrelation is expected. In fact, what we observe here +literally is the random walk across the posterior landscape, which in this case +is only impacted by the priors. + + +We can check that transmission trees are indeed freely explored: + +```{r, null_trees} + +plot(res_null, type = "alpha") + +``` + +Do **not** try to render the corresponding network using `plot(..., type = +"network")` as the force-direction algorithm will go insane. However, this +network can be visualised using *igraph*, extracting the edges and nodes from +the plot (without displaying it): + +```{r, null_net} + +## extract nodes and edges from the visNetwork object +temp <- plot(res_null, type = "network", min_support = 0) +class(temp) +head(temp$x$edges) +head(temp$x$nodes) + +## make an igraph object +library(igraph) + +net_null <- graph.data.frame(temp$x$edges, + vertices = temp$x$nodes[1:4]) + +plot(net_null, layout = layout.circle, + main = "Null model, posterior trees") + +``` + +We can derive similar diagnostics for the number of generations between cases +(`kappa`), only constrained by default settings to be between 1 and 5, and for +the infection dates (*t_inf*): + +```{r, res_null_diag} + +plot(res_null, type = "kappa") +plot(res_null, type = "t_inf") + +``` + +Finally, we can verify that the distributions of `mu` and `pi` match their +priors, respectively an exponential distribution with rate 1000 and a beta with +parameters 10 and 1. Here, we get a qualitative assessment by comparing the +observed distribution (histograms) to the densities of similar sized random +samples from the priors: + +```{r, res_null_priors} + +par(xpd=TRUE) +hist(res_null$mu, prob = TRUE, col = "grey", + border = "white", + main = "Distribution of mu") + +invisible(replicate(30, + points(density(rexp(500, 1000)), type = "l", col = "blue"))) + + +hist(res_null$pi, prob = TRUE, col = "grey", + border = "white", main = "Distribution of pi") + +invisible(replicate(30, + points(density(rbeta(500, 10, 1)), type = "l", col = "blue"))) + +``` + + + + + + +<br> + +## A model using symptom onset only + +We can use data and likelihood customisation to change the default *outbreaker2* +model into a [Wallinga & Teunis (1994)](http://dx.doi.org/10.1093/aje/kwh255) +model. To do so, we need to: + +- Remove the DNA sequences from the data; alternatively we could also specify a + 'null' function (i.e. returning a finite constant, as above) for the genetic + likelihood. + +- Disable all likelihood components other than `timing_infections` using + `custom_likelihoods`. This means that the dates provided will be treated as + dates of symptom onset, and the timing distribution `w` will be taken as the + serial interval. + +- Disable the detection of imported cases, and forcing all `kappa` values to be + 1. + + + +While these are fairly major changes, they are straightforward to implement. We +first create the dataset and custom likelihood functions: + +```{r, wt_custom} + +onset_data <- outbreaker_data(dates = fake_outbreak$onset, + w_dens = fake_outbreak$w) + +wt_model <- custom_likelihoods(timing_sampling = f_null, + reporting = f_null) + +``` + +To fix parameters or augmented data (here, fix all `kappa` values to 1), we set +the initial states to the desired values and disable the corresponding moves: + +```{r, wt_config} + +wt_config <- create_config(init_kappa = 1, move_kappa = FALSE, + init_pi = 1, move_pi = FALSE, + move_mu = FALSE) + +``` + +We can now run the analyses for this new model: +```{r, run_wt, cache = TRUE} + +set.seed(1) +res_wt <- outbreaker(data = onset_data, + config = wt_config, + likelihoods = wt_model) + +``` + + +```{r, res_wt} + +plot(res_wt) +plot(res_wt, burnin = 500) +plot(res_wt, burnin = 500, type = "alpha") +summary(res_wt) + +``` + +As before for the 'null' model, the transmission tree is very poorly resolved in +this case. We use the same approach to visualise it: extract nodes and edges +from the `visNetork` object, use this information to create an `igraph` object, +and visualise the result using a circular layout: + +```{r, wt_net} + +## extract nodes and edges from the visNetwork object +temp <- plot(res_wt, type = "network", min_support = 0.05) +class(temp) +head(temp$x$edges) +head(temp$x$nodes) + +## make an igraph object + +net_wt <- graph.data.frame(temp$x$edges, + vertices = temp$x$nodes[1:4]) + +plot(net_wt, layout = layout.circle, + main = "WT model, posterior trees") + +``` + + + + + + +<br> + +# Customising movements + +Customising movements works in similar ways to priors and likelihoods. In +practice, this type of customisation is more complex as it most likely will +require evaluation of likelihoods and priors, and therefore require the user to +know which functions to all, and how. These are documented in the [API +vignette](Rcpp_API.html). In the following, we provide two examples: + +- a (fake) Gibbs sampler for the movement of the mutation rate `mu` + +- a new 'naive' movement of ancestries in which infectors are picked at random + from all cases + +But before getting into these, we need to explicit how movements are happening +in *outbreaker2*. + + + +## Movements in *outbreaker2* + +At the core of the `outbreaker` function, movements are implemented as a list of +functions, which are all evaluated in turn during every iteration of the +MCMC. All movement functions must obey two rules: + +- The first argument must be an `outbreaker_param` object (typically called + `param` in the original code); see `?create_param` for details. + +- All movement functions must return a valid, `outbreaker_param` object. + + +However, a new difficulty compared to prior or likelihood customisation is that +different movements may require different components of the model, and a +different set of arguments after `param`. In fact, this can be seen by examining +the arguments of all the default movement functions: + +```{r, move_defaults} + +lapply(custom_moves(), args) + +``` + +To handle this difficulty, *outbreaker2* transforms every movement function +before running the MCMC into a new function with a single parameter `param`, +attaching all other required argument to the function's environment. The +function achieving this transformation is called `bind_moves`. This function +'knows' what these components are for known moves listed above. For new, +unknown moves, it attaches the following components, provided they are used as +arguments of the new function: + +- `data`: the processed data; see `?outbreaker_data` + +- `config`: the configuration list; see `create_config` + +- `likelihoods`: a list of custom likelihood functions; see + `?custom_likelihoods` + +- `priors`: a list of custom prior functions; see `?custom_priors` + + +See examples in `?bind_moves` for details of how this process works. + + + + + +## A (fake) Gibbs sampler for `mu` + +A Gibbs sampler supposes that the conditional distribution of a parameter is +known and can directly be sampled from. Here, we use this principle to provide a +toy example of custom movement for `mu`, assuming that this conditional +distribution is always an Exponential distribution with a rate of 1000. This is +easy to implement; to make sure that the function is actually used, we set a +global variable changed when the function is called. + + +```{r, custom_move_mu} + +move_mu <- function(param, config) { + + NEW_MOVE_HAS_BEEN_USED <<- TRUE + param$mu <- rexp(1, 1000) + return(param) + +} + +moves <- custom_moves(mu = move_mu) + +quick_config <- list(n_iter = 500, sample_every = 1, find_import = FALSE) + +``` + +Note that the new movement function `move_mu` has two arguments, and that we do +not specify `config`. Internally, what happens is: + +```{r, bind_moves} + +## bind quick_config to function +move_mu_intern <- bind_to_function(move_mu, config = quick_config) + +## new function has just one argument +move_mu_intern + +## 'config' is in the function's environment +names(environment(move_mu_intern)) + +## 'config' is actually 'quick_config' +identical(environment(move_mu_intern)$config, quick_config) + +``` + + +We perform a quick run using this new movement: + +```{r, run_custom_move_mu} + +NEW_MOVE_HAS_BEEN_USED <- FALSE + +set.seed(1) +res_move_mu <- outbreaker(data, quick_config, moves = moves) +NEW_MOVE_HAS_BEEN_USED + +plot(res_move_mu) +plot(res_move_mu, "pi") +plot(res_move_mu, "mu") + +``` + +This short, full trace, clearly hasn't mixed well (as is to be expected). But +while we see the effect of accept/reject sampling (Metropolis algorithm) for +`pi` with a lot of autocorrelation, the trace of `mu` shows complete +independence between successive values. While the Gibbs sampler used here is not +correct, this result is: a Gibbs sampler will be more efficient than the +classical Metropolis(-Hasting) algorithm for a given number a iterations. + + + + +<br> + +## A new movement of ancestries + +Moves of ancestries are done in two ways in outbreaker: by picking ancestors at +random from any prior case, and by swapping cases from a transmission +link. Here, we implement a new move, which will propose infectors which have +been infected on the same day of the current infector. As before, we will use +global variables to keep track of the resulting movements (see `N_ACCEPT` and +`N_REJECT`). + + +```{r, new_move_ances} + +api <- get_cpp_api() + +new_move_ances <- function(param, data, custom_likelihoods = NULL) { + +for (i in 1:data$N) { + current_ances <- param$alpha[i] + if (!is.na(current_ances)) { + ## find cases infected on the same days + current_t_inf <- param$t_inf[current_ances] + pool <- which(param$t_inf == current_t_inf) + pool <- setdiff(pool, i) + + if (length(pool) > 0) { + ## propose new ancestor + current_ll <- api$cpp_ll_all(data, param, i = i, custom_likelihoods) + param$alpha[i] <- sample(pool, 1) + new_ll <- api$cpp_ll_all(data, param, i = i, custom_likelihoods) + + ## likelihood ratio - no correction, move is symmetric + ratio <- exp(new_ll - current_ll) + + ## accept / reject + if (runif(1) <= ratio) { # accept + N_ACCEPT <<- N_ACCEPT + 1 + } else { # reject + N_REJECT <<- N_REJECT + 1 + param$alpha[i] <- current_ances + } + } + } + } + return(param) +} + +moves <- custom_moves(new_move = new_move_ances) + +``` + +We can now use this new move in our transmission tree reconstruction. We will +use a shorter chain than the defaults as this new move is likely to be computer +intensive. + + + +```{r, run_new_move, cache = TRUE} + +N_ACCEPT <- 0 +N_REJECT <- 0 + +set.seed(1) +res_new_move <- outbreaker(data, list(move_kappa = FALSE), moves = moves) + +N_ACCEPT +N_REJECT + +``` + + +```{r, res_new_move} + +plot(res_new_move) +plot(res_new_move, type = "alpha") +summary(res_new_move) + +``` + +Results show a switch to a new mode at about 5000 iterations. Let us compare the +consensus tree to the actual one (store in `fake_outbreak$ances`): + +```{r, check_new_move} + +summary(res_new_move, burnin = 5000) +tree2 <- summary(res_new_move, burnin = 5000)$tree + +comparison <- data.frame(case = 1:30, + inferred = paste(tree2$from), + true = paste(fake_outbreak$ances), + stringsAsFactors = FALSE) + +comparison$correct <- comparison$inferred == comparison$true +comparison +mean(comparison$correct) + +```