--- title: "1. Overview" output: bookdown::html_document2: base_format: rmarkdown::html_vignette fig_caption: yes toc: true toc_depth: 2 number_sections: true vignette: > %\VignetteIndexEntry{1. Overview} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4 ) ``` # Introduction **bage** (= "Bayesian" + "age") implements Bayesian hierarchical models for rates, probabilities, and means, where the rates, probabilities and means are classified by variables such as age, sex, and region. Models in **bage** can be used for estimation and for forecasting. The models in **bage** are hierarchical in that they are built out of smaller submodels. A model for mortality rates, for instance, might contain submodels describing how rates vary over age, sex, and time. Internally, **bage** draws on package [TMB](https://CRAN.R-project.org/package=TMB) to fit the models. **TMB** is fast and can handle large datasets -- characteristics that are inherited by **bage**. This vignette introduces the main features of **bage**, using data on injuries. # Preliminaries ## Packages We begin by loading the packages that we will need for the analysis of the injuries data. Loading **bage** automatically loads package **rvec**, which contains functions for working with draws from probability distributions. Package **poputils** contains functions for working with demographic data. Packages **dplyr** and **tidyr** are core [tidyverse](https://www.tidyverse.org) packages for manipulating data. We use **ggplot2** to graph inputs and outputs. ```{r setup} library(bage) library(poputils) library(dplyr) library(tidyr) library(ggplot2) ``` ## Data We analyse a dataset called `nzl_injuries` that is included in the **bage** package. The dataset contains counts of fatal injuries and population in New Zealand, classified by age, sex, ethnicity, and year. ```{r} head(nzl_injuries) ``` ```{r} nzl_injuries |> filter(year %in% c(2000, 2006, 2012, 2018)) |> ggplot(aes(x = age_mid(age), y = injuries / popn, color = sex)) + facet_grid(vars(ethnicity), vars(year)) + geom_line() + xlab("age") + theme(legend.position = "top", legend.title = element_blank()) ``` # Specifying a model ## Functions for specifying models We specify a model in which, within each combination of age, sex, ethnicity, and year, injuries are treated as a random draw from a Poisson distribution. The function used to specify a Poisson model is `mod_pois()`. Binomial models can be specified with function `mod_binom()` and normal models with function `mod_norm()`. The first argument to `mod_pois()` is an R-style formula identifying the outcome variable and the combinations of predictor variables to be used in the model. ```{r} mod <- mod_pois(injuries ~ age * sex + age * ethnicity + year, data = nzl_injuries, exposure = popn) ``` ## Model structure The resulting model has the following structure: ```{r, echo=FALSE, out.width="90%", fig.cap = "Structure of model. Rectangles denote data, ellipses denote unknown quantities that are inferred within the model, solid arrows denote probabilistic relationships, and dashed arrows denote deterministic relationships."} knitr::include_graphics("vig1_dag.png") ``` The number of injuries occurring within each combination of age, sex, ethnicity, and time reflects (i) the population at risk and (ii) an underlying rate that in **bage** is referred to as `.fitted`. The expected value for `.fitted` is obtained by summing up values for the intercept, age effect, sex effect, and so forth. The actual value of `.fitted` can diverge from `.expected`: the amount of divergence is governed by the `disp` (dispersion) parameter. The model terms are all given "prior distributions". A prior distribution is a submodel capturing features of the unknown quantity or quantities being estimated. Possible features include the range within which the quantity is likely to fall, or the amount of smoothness expected in series of values. Priors distributions are a distinctive feature of Bayesian methods. ## Printing the model object Printing a model object provides information on its structure: ```{r} mod ``` The table in the the middle of the printout above shows the default prior distribution assigned to each term. We return to priors in Section \@ref(sec:priors). The bottom row of the printed object shows various model settings. `n_draw` is the number of random draws produced by extractor functions, which we discuss in Section \@ref(sec:outputs). `var_time`, `var_age`, and `var_sexgender` are the names of the variables in `nzl_injuries` that represent time, age, and sex or gender. If **bage** fails to correctly identify these variables, they can be identified using functions such as `set_var_time()`. # Fitting a model {#sec:fitting} ## fit() Function `mod_pois()` specifies a model, but does not actually estimate any of the unknown quantities in the model. For that, we need function `fit()`. ```{r} mod <- mod |> fit() ``` ## Reprinting the model object The printout for a fitted model differs from that of an unfitted model. ```{r} mod ``` Among other things, a new row appears at the bottom of the printout, providing information about the fitting process. The most important value piece of information is whether the calculations have converged. # Extracting outputs {#sec:outputs} ## Extractor functions Function `fit()` derives values for the unknown quantities, but does not return these quantities. For that, we need an extractor function. The two most important extractor functions in **bage** are `augment()` and `components()`. Both of these are generic functions that work on many sorts of R objects (see [here](https://generics.r-lib.org/index.html) and [here](https://www.tidymodels.org/learn/develop/broom/#glossary)). ## augment() `augment()` returns the original dataset plus some additional columns of estimated values, ```{r} aug <- mod |> augment() aug ``` The columns of estimated values are `.observed`, `.fitted`, and `.expected`. ```{r} aug |> select(.observed, .fitted, .expected) ``` In a Poisson model with exposures, - `.observed` is a "direct" estimate of the rate, obtained by dividing the the outcome variable by the population at risk; - `.expected` is a model-based estimate of the rate, based purely on model predictors (eg age, sex, time); and - `.fitted` is a model-based estimate of the rate that is a compromise between `.observed` and `.expected`. ## components() `components()` is used to extract values for higher-level parameters. It returns values for all the parameters. ```{r} comp <- mod |> components() comp ``` A little bit of manipulation is required to extract and graph the parameter of interest, ```{r} age_effect <- comp |> filter(term == "age", component == "effect") |> select(age = level, .fitted) age_effect ``` ## Posterior samples The output from a Bayesian model typically consists of draws from the "posterior distribution" for unknown quantities. The posterior distribution is a probability distribution that describes what we know about the unknown quantities in the model after applying our combining model assumptions (including priors) and data. ## Rvecs Draws from probability distributions can be awkward to work with, so **bage** uses a special type of vector, called an "rvec", implemented by package **rvec**. An rvec contains multiple draws, but tries to behave as much as possible like a standard numerical vector. The printout of `.fitted` and `.expected` in the `aug` object above shows medians and 95\% credible intervals. ## Graphing outputs The best way to understand output from a fitted model it to graph it. We first need to prepare data for the graph. We select values for 2018 and use the `draws_ci()` function from **rvec** to create 95\% credible intervals. ```{r} data_plot <- aug |> filter(year == 2018) |> mutate(draws_ci(.fitted)) data_plot |> select(starts_with(".fitted")) ``` We then use `ggplot()` to visualise the outputs. In the figure below we use use `geom_ribbon()` to plot credible intervals, `geom_line()` to plot model-based point estimates, and `geom_point()` to plot direct estimates. ```{r} ggplot(data_plot, aes(x = age_mid(age))) + facet_grid(vars(sex), vars(ethnicity)) + geom_ribbon(aes(ymin = .fitted.lower, ymax = .fitted.upper), fill = "lightblue") + geom_line(aes(y = .fitted.mid), color = "darkblue") + geom_point(aes(y = .observed), color = "red") + xlab("age") ``` # Priors {#sec:priors} ## Priors in Bayesian models In a Bayesian model, every unknown quantity needs a prior distribution. When the data provides lots of information on the value of an unknown quantity, the particular choice of prior distribution for that quantity does not normally have much effect on the ultimate estimates for the quantity. But when the data provides only limited information, different priors can lead to very different estimates. The choice of prior is particularly important when forecasting, since the data provides provides only indirect information about events in the future. ## Current priors in bage `help(priors)` produces a list of the priors that have been implemented so far in **bage**. Examples of priors are: - `NFix()`. Each element of the term being modeled is drawn from a normal distribution with mean `0`, and standard deviation `sd`. By default, `sd` is 1. - `N()`. Like `NFix()`, but `sd` is estimated from the data. - `RW()`. The elements of the term being modeled follow a random walk. This is appropriate for terms involving time and age, where neighboring elements are strongly correlated. - `AR1()`. First-order autoregressive process. Suitable for time series that revert to a mean value. ## Defaults **bage** uses the following rules to assign default priors to a model term: - if the term has less than 3 elements, use `NFix()`; - otherwise, if the term involves time, use `RW()`, with time as the `along' dimension; - otherwise, if the term involves age, use `RW()`, with age as the `along' dimension; - otherwise, use `N()`. Priors can be over-ridden using `set_prior()`: ```{r} mod <- mod |> set_prior(year ~ AR1()) ``` Replacing a prior deletes any existing estimates and returns a model to an 'unfitted' state. ```{r} is_fitted(mod) ``` So we re-fit the model. ```{r} mod <- mod |> fit() ``` ## SVD-based priors **bage** implements a special type of prior based on applying a singular value decomposition (SVD) to data from international databases. These SVD-based priors represent the age-sex patterns in demographic processes such as fertility, mortality, and labor force participation in a parsimonious way. The model below, for instance, uses SVD-based priors for age effects and age-time interactions in a model of regional fertility rates in Korea. Note that the number of free parameters (denoted `n_par_free`) for `age` and `age:time` in the printout below is much less than the total number of parameters (denoted `n_par`) for `age` and `age:time`. ```{r} mod_births <- mod_pois(births ~ age * region + age * time, data = kor_births, exposure = popn) |> set_prior(age ~ SVD(HFD)) |> set_prior(age:time ~ SVD_RW(HFD)) |> fit() mod_births ``` # Forecasting Forecasts can be constructed by calling function `forecast()` on a model object. When `output` is `"augment"` (the default), `forecast()` produces values like those produced by `augment()`. ```{r} aug_forecast <- mod |> forecast(labels = 2019:2028) names(aug_forecast) ``` When `output` is `"components"`, `forecast()` produces values like those produced by `components()`. ```{r} comp_forecast <- mod |> forecast(labels = 2019:2028, output = "components") comp_forecast ``` When `include_estimates` is `TRUE`, the return value includes historical estimates. This is useful for plotting. ```{r} data_forecast <- mod |> fit() |> forecast(labels = 2019:2028, include_estimates = TRUE) |> filter(sex == "Female", age %in% c("10-14", "25-29", "40-44")) |> mutate(draws_ci(.fitted)) ggplot(data_forecast, aes(x = year)) + facet_grid(vars(age), vars(ethnicity)) + geom_ribbon(aes(ymin = .fitted.lower, ymax = .fitted.upper), fill = "lightblue") + geom_line(aes(y = .fitted.mid), color = "darkblue") + geom_point(aes(y = .observed), color = "red") ``` # Imputation **bage** automatically accommodates missing values in the outcome variables. We illustrate with a version of the injuries dataset where values for 2010--2014 are set to `NA`. ```{r} years_mis <- 2010:2014 injuries_mis <- nzl_injuries |> mutate(injuries = if_else(year %in% years_mis, NA, injuries)) ``` We fit our exactly the same model that we use for the complete dataset. ```{r} mod_mis <- mod_pois(injuries ~ age * sex + age * ethnicity + year, data = injuries_mis, exposure = popn) |> fit() ``` **bage** creates a new variable, called `.injuries` containing imputed values for the missing outcomes. ```{r} mod_mis |> augment() |> filter(year %in% years_mis) ``` Rates estimates in years where the outcome is missing have wider credible intervals than rates estimates in years where the outcome is observed. ```{r} data_plot_mis <- mod_mis |> augment() |> filter(age == "20-24") |> mutate(draws_ci(.fitted)) ggplot(data_plot_mis, aes(x = year)) + facet_grid(vars(sex), vars(ethnicity)) + geom_ribbon(aes(ymin = .fitted.lower, ymax = .fitted.upper), fill = "lightblue") + geom_line(aes(y = .fitted.mid), color = "darkblue") + geom_point(aes(y = .observed), color = "red") + xlab("age") ``` # Data models So far we have ignored the possibility that the input data for our models might be subject to measurement errors. However, `injuries` variable in the `nzl_injuries` dataset is subject to a special sort of error: Statistics New Zealand has randomly-rounded the values to base 3 to protect confidentiality. To deal with measurement error, we add a 'data model' to our model. ```{r} mod <- mod |> set_datamod_outcome_rr3() |> fit() ``` The results from calling `augment()` now include a variable called `.injuries` with estimated values for the true, unrounded injury counts. ```{r} mod |> augment() ``` # Model checking ## Replicate data A standard Bayesian approach to checking a model is to use the model to generate simulate data and see if the simulated data looks like the actual data. Function `replicate_data()` creates multiple sets of simulated data. ```{r} rep_data <- mod |> replicate_data() rep_data ``` Comparing full datasets is difficult, so the usual strategy is to calculate summary measures that capture some important feature of the data, and compare those instead. Here we see if the model is properly capturing male-female differences in injury rates. ```{r} sex_ratio <- rep_data |> count(.replicate, year, sex, wt = injuries) |> pivot_wider(names_from = sex, values_from = n) |> mutate(ratio = Male / Female) sex_ratio ``` We graph the results and see if the original data looks like it was drawn from the same underlying distribution as the simulated data. ```{r} ggplot(sex_ratio, aes(x = year, y = ratio)) + facet_wrap(vars(.replicate)) + geom_line() ``` ## Simulation studies A simulation study, where we create the data ourselves, and hence know what the underlying true values are, can be a useful way of assessing model performance. In the example below, we use function `report_sim()` to perform a simple simulation study where the true population is generated using a first-order random walk, but the estimation model assumes that the population is generated using a second-order random walk. ```{r} set.seed(0) ## Create simulated data fake_data <- data.frame(year = 2001:2010, population = NA) ## Define the true data-generating model mod_rw <- mod_pois(population ~ year, data = fake_data, exposure = 1) |> set_prior(`(Intercept)` ~ NFix(sd = 0.1)) |> set_prior(year ~ RW(s = 0.1, sd = 0.1)) ## Define the estimation model mod_rw2 <- mod_pois(population ~ year, data = fake_data, exposure = 1) |> set_prior(year ~ RW2()) ## Run the simulation report_sim(mod_est = mod_rw2, mod_sim = mod_rw) ``` ## Prior predictive checks Another way of gaining insights about a model is to look at estimates based purely on the priors, without using data on the outcome variable. In **bage**, this can be done by calling `augment()`, `components()` or `forecast()` on an unfitted version of the model. ```{r} mod |> unfit() |> components() ``` # Future development of bage ## Future features **bage** is a new package, and still under very active development. Some features that we intend to add are: - **Covariates** The ability to add covariates as predictors in a model, eg data on regional incomes in a model of regional mortality. - **Data models** Additional data models, eg models that allow for under-reporting of events, or errors in the exposure variable. - **Priors** More options for priors, eg a damped linear trend. - **Sets of priors** Pre-specified collections of priors for specific purposes such as modelling fertility rates - **Documentation** More vignettes and examples. - **Model choice** Tools for model comparison and model choice ## Experimental status **bage** currently has an [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) life cycle badge, to warn users that some features of the **bage** interface, such as function arguments, are still evolving. From version 0.9.0 onwards, however, we will use a formal deprecation process to manage any changes to existing function arguments or defaults. ## Bug reports and feature requests We would be grateful for bug reports or suggestions for features. The best way to do so is to raise an issue on the [bage GitHub repository](https://github.com/bayesiandemography/bage/issues).