outbreaker2 / Git / Diff of /vignettes/customisation.Rmd

Models:

MarcoTheBlack/

outbreaker2

Downloads: 1

Diff of /vignettes/customisation.Rmd [000000] .. [dfe06d]

Switch to unified view

 b/vignettes/customisation.Rmd
+---
+title: "Using custom priors, likelihood, or movements in outbreaker2"
+author: "Thibaut Jombart"
+date: "`r Sys.Date()`"
+output:
+   rmarkdown::html_vignette:
+     toc: true
+     toc_depth: 2
+vignette: >
+  %\VignetteIndexEntry{Using custom priors, likelihood, or movements in outbreaker2}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+```{r, echo = FALSE}
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>",
+  fig.width=8,
+  fig.height=5,
+  fig.path="figs-customisation/"
+)
+```
+In this vignette, we show how custom functions for priors, likelihood, or
+movement of parameters and augmented data can be used in *outbreaker2*. In all
+these functions, the process will be similar:
+. write your own function with the right arguments
+. pass this function as an argument to a `custom...` function
+. pass the result to *outbreaker2*
+Note that 2-3 can be a single step if passing the function to the arguments of
+*outbreaker2* directly. Also note that **all priors and likelihoods are expected
+on a log scale**. Finally, also note that while the various `custom...`
+functions will try to some extent to check that the provided functions are
+valid, such tests are very difficult to implement. In short: you are using these
+custom features at your own risks - make sure these functions work before
+passing them to *outbreaker2*.
+<br>
+# Customising priors
+Priors of *outbreaker2* must be a function of an `outbreaker_param` list (see
+`?outbreaker_param`). Here, we decide to use a step function rather than the
+default Beta function as a prior for *pi*, the reporting probability, and a flat
+prior between 0 and 1 for the mutation rate (which is technically a probability
+in the basic genetic model used in *outbreaker2*).
+We start by defining two functions: an auxiliary function `f` which returns
+values on the natural scale, and which we can use for plotting the prior
+distribution, and then a function `f_pi` which will be used for the
+customisation.
+```{r, f_pi}
+f <- function(pi) {
+    ifelse(pi < 0.8, 0, 5)
+}
+f_pi <- function(param) {
+    log(f(param$pi))
+}
+plot(f, type = "s", col = "blue",
+     xlab = expression(pi), ylab = expression(p(pi)),
+     main = expression(paste("New prior for ", pi)))
+```
+While `f` is a useful function to visualise the prior, `f_pi` is the function
+which will be passed to `outbreaker`. To do so, we pass it to `custom_priors`:
+```{r, custom_prior}
+library(outbreaker2)
+f_mu <- function(param) {
+  if (param$mu < 0 || param$mu > 1) {
+    return(-Inf)
+  } else {
+    return(0.0)
+  }
+}
+priors <- custom_priors(pi = f_pi, mu = f_mu)
+priors
+```
+Note that `custom_priors` does more than just adding the custom function to a
+list. For instance, the following customisations are all wrong, and rightfully
+rejected:
+```{r, wrong_prior, error = TRUE}
+## wrong: not a function
+## should be pi = function(x){0.0}
+custom_priors(pi = 0.0)
+## wrong: two arguments
+custom_priors(pi = function(x, y){0.0})
+```
+We can now use the new priors to run `outbreaker` on the `fake_outbreak` data
+(see [*introduction vignette*](introduction.html)):
+```{r, run_custom_priors}
+dna <- fake_outbreak$dna
+dates <- fake_outbreak$sample
+w <- fake_outbreak$w
+data <- outbreaker_data(dna = dna, dates = dates, w_dens = w)
+## we set the seed to ensure results won't change
+set.seed(1)
+res <- outbreaker(data = data, priors = priors)
+```
+We can check the results first by looking at the traces, and then by plotting
+the posterior distributions of `pi` and `mu`, respectively:
+```{r, traces_custom_priors}
+plot(res)
+plot(res, "pi", burnin = 500)
+plot(res, "mu", burnin = 500)
+plot(res, "pi", type = "density", burnin = 500)
+plot(res, "mu", type = "hist", burnin = 500)
+```
+Note that we are using density and histograms here for illustrative purposes,
+but there is no reason to prefer one or the other for a specific
+parameter.
+Interestingly, the trace of `pi` suggests that the MCMC oscillates between two
+different states, on either bound of the interval on which the prior is positive
+(it is `-Inf` outside (0.8; 1)). This may be a consequence of the step function,
+which causes sharp 'cliffs' in the posterior landscape. What shall one do to
+derive good samples from the posterior distribution in this kind of situation?
+There are several options, which in fact apply to typical cases of multi-modal
+posterior distributions:
+- Avoid 'cliffs', i.e. sharp drops in the posterior landscape, typically created
+  by using step-functions in likelihoods and in priors.
+- Use larger samples, i.e. run more MCMC iterations.
+- Use a different sampler, better than Metropolis-Hasting at deriving samples
+  from multi-modal distributions.
+Because we know what the real transmission tree is for this dataset, we can
+assess how the new priors impacted the inference of the transmission tree.
+```{r, tree_custom_priors}
+summary(res, burnin = 500)
+tree <- summary(res, burnin = 500)$tree
+comparison <- data.frame(case = 1:30,
+                         inferred = paste(tree$from),
+             true = paste(fake_outbreak$ances),
+             stringsAsFactors = FALSE)
+comparison$correct <- comparison$inferred == comparison$true
+comparison
+mean(comparison$correct)
+```
+<br>
+# Customizing likelihood
+Likelihood functions customisation works identically to prior functions. The
+only difference is that custom functions will take two arguments (`data` and
+`param`) instead of one in the prior functions. The function used to specify
+custom likelihood is `custom_likelihoods`. Each custom function will correspond
+to a specific likelihood component:
+```{r, likelihood_components}
+custom_likelihoods()
+```
+see `?custom_likelihoods` for details of these components, and see the section
+'Extending the model' for new, other components. As for `custom_priors`, a few
+checks are performed by `custom_likelihoods`:
+```{r, wrong_likelihood, error = TRUE}
+## wrong: not a function
+custom_likelihoods(genetic = "fubar")
+## wrong: only one argument
+custom_likelihoods(genetic = function(x){ 0.0 })
+```
+A trivial customisation is to disable some or all of the likelihood components
+of the model by returning a finite constant. Here, we apply this to two cases:
+first, we will disable all likelihood components as a sanity check, making sure
+that the transmission tree landscape is explored freely by the MCMC. Second, we
+will recreate the [Wallinga & Teunis (1994)](http://dx.doi.org/10.1093/aje/kwh255)
+model, by disabling specific components.
+## A null model
+```{r, null_model}
+f_null <- function(data, param) {
+   return(0.0)
+}
+null_model <- custom_likelihoods(genetic = f_null,
+                                 timing_sampling = f_null,
+                                 timing_infections = f_null,
+                                 reporting = f_null,
+                                 contact = f_null)
+null_model
+```
+We also specify settings via the `config` argument to avoid detecting imported
+cases, reduce the number of iterations and sampling each of them:
+```{r, run_null_model, cache = TRUE}
+null_config <- list(find_import = FALSE,
+n_iter = 500,
+sample_every = 1)
+set.seed(1)
+res_null <- outbreaker(data = data,
+config = null_config,
+likelihoods = null_model)
+```
+```{r, res_null_model}
+plot(res_null)
+plot(res_null, "pi")
+plot(res_null, "mu")
+```
+By typical MCMC standards, these traces look appalling, as they haven't reach
+stationarity (i.e. same mean and variance over time), and are grossly
+autocorrelated in parts. Fair enough, as these are only the first 500 iterations
+of the MCMC, so that autocorrelation is expected. In fact, what we observe here
+literally is the random walk across the posterior landscape, which in this case
+is only impacted by the priors.
+We can check that transmission trees are indeed freely explored:
+```{r, null_trees}
+plot(res_null, type = "alpha")
+```
+Do **not** try to render the corresponding network using `plot(..., type =
+"network")` as the force-direction algorithm will go insane. However, this
+network can be visualised using *igraph*, extracting the edges and nodes from
+the plot (without displaying it):
+```{r, null_net}
+## extract nodes and edges from the visNetwork object
+temp <- plot(res_null, type = "network", min_support = 0)
+class(temp)
+head(temp$x$edges)
+head(temp$x$nodes)
+## make an igraph object
+library(igraph)
+net_null <- graph.data.frame(temp$x$edges,
+                             vertices = temp$x$nodes[1:4])
+plot(net_null, layout = layout.circle,
+     main = "Null model, posterior trees")
+```
+We can derive similar diagnostics for the number of generations between cases
+(`kappa`), only constrained by default settings to be between 1 and 5, and for
+the infection dates (*t_inf*):
+```{r, res_null_diag}
+plot(res_null, type = "kappa")
+plot(res_null, type = "t_inf")
+```
+Finally, we can verify that the distributions of `mu` and `pi` match their
+priors, respectively an exponential distribution with rate 1000 and a beta with
+parameters 10 and 1. Here, we get a qualitative assessment by comparing the
+observed distribution (histograms) to the densities of similar sized random
+samples from the priors:
+```{r, res_null_priors}
+par(xpd=TRUE)
+hist(res_null$mu, prob = TRUE, col = "grey",
+     border = "white",
+     main = "Distribution of mu")
+invisible(replicate(30,
+     points(density(rexp(500, 1000)), type = "l", col = "blue")))
+hist(res_null$pi, prob = TRUE, col = "grey",
+     border = "white", main = "Distribution of pi")
+invisible(replicate(30,
+     points(density(rbeta(500, 10, 1)), type = "l", col = "blue")))
+```
+<br>
+## A model using symptom onset only
+We can use data and likelihood customisation to change the default *outbreaker2*
+model into a [Wallinga & Teunis (1994)](http://dx.doi.org/10.1093/aje/kwh255)
+model. To do so, we need to:
+- Remove the DNA sequences from the data; alternatively we could also specify a
+  'null' function (i.e. returning a finite constant, as above) for the genetic
+  likelihood.
+- Disable all likelihood components other than `timing_infections` using
+  `custom_likelihoods`. This means that the dates provided will be treated as
+  dates of symptom onset, and the timing distribution `w` will be taken as the
+  serial interval.
+- Disable the detection of imported cases, and forcing all `kappa` values to be
+.
+While these are fairly major changes, they are straightforward to implement. We
+first create the dataset and custom likelihood functions:
+```{r, wt_custom}
+onset_data <- outbreaker_data(dates = fake_outbreak$onset,
+                          w_dens = fake_outbreak$w)
+wt_model <- custom_likelihoods(timing_sampling = f_null,
+                               reporting = f_null)
+```
+To fix parameters or augmented data (here, fix all `kappa` values to 1), we set
+the initial states to the desired values and disable the corresponding moves:
+```{r, wt_config}
+wt_config <- create_config(init_kappa = 1, move_kappa = FALSE,
+                           init_pi = 1, move_pi = FALSE,
+                           move_mu = FALSE)
+```
+We can now run the analyses for this new model:
+```{r, run_wt, cache = TRUE}
+set.seed(1)
+res_wt <- outbreaker(data = onset_data,
+                     config = wt_config,
+                     likelihoods = wt_model)
+```
+```{r, res_wt}
+plot(res_wt)
+plot(res_wt, burnin = 500)
+plot(res_wt, burnin = 500, type = "alpha")
+summary(res_wt)
+```
+As before for the 'null' model, the transmission tree is very poorly resolved in
+this case. We use the same approach to visualise it: extract nodes and edges
+from the `visNetork` object, use this information to create an `igraph` object,
+and visualise the result using a circular layout:
+```{r, wt_net}
+## extract nodes and edges from the visNetwork object
+temp <- plot(res_wt, type = "network", min_support = 0.05)
+class(temp)
+head(temp$x$edges)
+head(temp$x$nodes)
+## make an igraph object
+net_wt <- graph.data.frame(temp$x$edges,
+                             vertices = temp$x$nodes[1:4])
+plot(net_wt, layout = layout.circle,
+     main = "WT model, posterior trees")
+```
+<br>
+# Customising movements
+Customising movements works in similar ways to priors and likelihoods. In
+practice, this type of customisation is more complex as it most likely will
+require evaluation of likelihoods and priors, and therefore require the user to
+know which functions to all, and how. These are documented in the [API
+vignette](Rcpp_API.html). In the following, we provide two examples:
+- a (fake) Gibbs sampler for the movement of the mutation rate `mu`
+- a new 'naive' movement of ancestries in which infectors are picked at random
+  from all cases
+But before getting into these, we need to explicit how movements are happening
+in *outbreaker2*.
+## Movements in *outbreaker2*
+At the core of the `outbreaker` function, movements are implemented as a list of
+functions, which are all evaluated in turn during every iteration of the
+MCMC. All movement functions must obey two rules:
+- The first argument must be an `outbreaker_param` object (typically called
+  `param` in the original code); see `?create_param` for details.
+- All movement functions must return a valid, `outbreaker_param` object.
+However, a new difficulty compared to prior or likelihood customisation is that
+different movements may require different components of the model, and a
+different set of arguments after `param`. In fact, this can be seen by examining
+the arguments of all the default movement functions:
+```{r, move_defaults}
+lapply(custom_moves(),  args)
+```
+To handle this difficulty, *outbreaker2* transforms every movement function
+before running the MCMC into a new function with a single parameter `param`,
+attaching all other required argument to the function's environment. The
+function achieving this transformation is called `bind_moves`. This function
+'knows' what these components are for known moves listed above. For new,
+unknown moves, it attaches the following components, provided they are used as
+arguments of the new function:
+- `data`: the processed data; see `?outbreaker_data`
+- `config`: the configuration list; see `create_config`
+- `likelihoods`: a list of custom likelihood functions; see
+  `?custom_likelihoods`
+- `priors`: a list of custom prior functions; see `?custom_priors`
+See examples in `?bind_moves` for details of how this process works.
+## A (fake) Gibbs sampler for `mu`
+A Gibbs sampler supposes that the conditional distribution of a parameter is
+known and can directly be sampled from. Here, we use this principle to provide a
+toy example of custom movement for `mu`, assuming that this conditional
+distribution is always an Exponential distribution with a rate of 1000. This is
+easy to implement; to make sure that the function is actually used, we set a
+global variable changed when the function is called.
+```{r, custom_move_mu}
+move_mu <- function(param, config) {
+  NEW_MOVE_HAS_BEEN_USED <<- TRUE
+  param$mu <- rexp(1, 1000)
+  return(param)
+}
+moves <- custom_moves(mu = move_mu)
+quick_config <- list(n_iter = 500, sample_every = 1, find_import = FALSE)
+```
+Note that the new movement function `move_mu` has two arguments, and that we do
+not specify `config`. Internally, what happens is:
+```{r, bind_moves}
+## bind quick_config to function
+move_mu_intern <- bind_to_function(move_mu, config = quick_config)
+## new function has just one argument
+move_mu_intern
+## 'config' is in the function's environment
+names(environment(move_mu_intern))
+## 'config' is actually 'quick_config'
+identical(environment(move_mu_intern)$config, quick_config)
+```
+We perform a quick run using this new movement:
+```{r, run_custom_move_mu}
+NEW_MOVE_HAS_BEEN_USED <- FALSE
+set.seed(1)
+res_move_mu <- outbreaker(data, quick_config, moves = moves)
+NEW_MOVE_HAS_BEEN_USED
+plot(res_move_mu)
+plot(res_move_mu, "pi")
+plot(res_move_mu, "mu")
+```
+This short, full trace, clearly hasn't mixed well (as is to be expected). But
+while we see the effect of accept/reject sampling (Metropolis algorithm) for
+`pi` with a lot of autocorrelation, the trace of `mu` shows complete
+independence between successive values. While the Gibbs sampler used here is not
+correct, this result is: a Gibbs sampler will be more efficient than the
+classical Metropolis(-Hasting) algorithm for a given number a iterations.
+<br>
+## A new movement of ancestries
+Moves of ancestries are done in two ways in outbreaker: by picking ancestors at
+random from any prior case, and by swapping cases from a transmission
+link. Here, we implement a new move, which will propose infectors which have
+been infected on the same day of the current infector. As before, we will use
+global variables to keep track of the resulting movements (see `N_ACCEPT` and
+`N_REJECT`).
+```{r, new_move_ances}
+api <- get_cpp_api()
+new_move_ances <- function(param, data, custom_likelihoods = NULL) {
+for (i in 1:data$N) {
+    current_ances <- param$alpha[i]
+    if (!is.na(current_ances)) {
+      ## find cases infected on the same days
+      current_t_inf <- param$t_inf[current_ances]
+      pool <- which(param$t_inf == current_t_inf)
+      pool <- setdiff(pool, i)
+      if (length(pool) > 0) {
+        ## propose new ancestor
+        current_ll <- api$cpp_ll_all(data, param, i = i, custom_likelihoods)
+        param$alpha[i] <- sample(pool, 1)
+        new_ll <- api$cpp_ll_all(data, param, i = i, custom_likelihoods)
+        ## likelihood ratio - no correction, move is symmetric
+        ratio <- exp(new_ll - current_ll)
+        ## accept / reject
+        if (runif(1) <= ratio) { # accept
+          N_ACCEPT <<- N_ACCEPT + 1
+        } else { # reject
+          N_REJECT <<- N_REJECT + 1
+          param$alpha[i] <- current_ances
+        }
+      }
+    }
+  }
+  return(param)
+}
+moves <- custom_moves(new_move = new_move_ances)
+```
+We can now use this new move in our transmission tree reconstruction. We will
+use a shorter chain than the defaults as this new move is likely to be computer
+intensive.
+```{r, run_new_move, cache = TRUE}
+N_ACCEPT <- 0
+N_REJECT <- 0
+set.seed(1)
+res_new_move <- outbreaker(data, list(move_kappa = FALSE), moves = moves)
+N_ACCEPT
+N_REJECT
+```
+```{r, res_new_move}
+plot(res_new_move)
+plot(res_new_move, type = "alpha")
+summary(res_new_move)
+```
+Results show a switch to a new mode at about 5000 iterations. Let us compare the
+consensus tree to the actual one (store in `fake_outbreak$ances`):
+```{r, check_new_move}
+summary(res_new_move, burnin = 5000)
+tree2 <- summary(res_new_move, burnin = 5000)$tree
+comparison <- data.frame(case = 1:30,
+                         inferred = paste(tree2$from),
+             true = paste(fake_outbreak$ances),
+             stringsAsFactors = FALSE)
+comparison$correct <- comparison$inferred == comparison$true
+comparison
+mean(comparison$correct)
+```