---
title: "9. Covariates"
output:
  bookdown::html_document2:
    base_format: rmarkdown::html_vignette
    fig_caption: yes
    toc: true
    toc_depth: 2
    number_sections: true
pkgdown:
  as_is: true
vignette: >
  %\VignetteIndexEntry{9. Covariates}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
set.seed(0)
```

# Introduction

Covariates can be used to extend standard **bage** models in two ways:

1. incorporating information beyond what is contained in classifying variables, such as age, sex, or region; and 
2. dealing with subsets of the data with unusual behavior.

This vignette gives a brief description of covariates in **bage** and then illustrates their use with a case study of births in South Korea.


# Mathematical details

A model in **bage** typically has a structure that is something like:
\begin{align}
  y_{ast} & \sim \text{Poisson}(\gamma_{ast} w_{ast}) \\
  \log \gamma_{ast} & \sim \text{Gamma}(\xi^{-1}, (\mu_{ast} \xi)^{-1}) \\
  \log \mu_{ast} & = \beta^{(0)} + \beta_{as}^{\text{age:sex}} + \beta_t^{\text{time}} (\#eq:priormod)
\end{align}
where

- $y_{ast}$ is the outcome for people in age group $a$ and sex $s$ during period $t$;
- $w_{ast}$ is exposure;
- $\gamma_{ast}$ is a rate;
- $\xi$ governs overall dispersion;
- $\beta^{(0)}$ is an intercept;
- $\pmb{\beta}^{\text{age:sex}}$ and $\pmb{\beta}^{\text{time}}$ are terms formed the classifying variables age, sex, and time;
- $\beta_{as}^{\text{age:sex}}$ and $\pmb{\beta}_t^{\text{time}}$ are elements of these terms; and
- the intercept and the terms formed from the classifying variables all have prior distributions.

In a model with covariates, \@ref(eq:priormod) changes to
\begin{equation}
  \log \mu_{ast} = \beta^{(0)} + \beta_{as}^{\text{age:sex}} + \beta_t^{\text{time}} + (\pmb{Z} \pmb{\zeta})_{ast}
\end{equation}

where

- $\pmb{Z}$ is a $N \times P$ covariate matrix;
- $\pmb{\zeta}$ is a $P$-element vector of coefficients; and
- $(\pmb{Z} \pmb{\zeta})_{ast}$ is an element from the vector obtained by multiplying $\pmb{Z}$ and $\pmb{\zeta}$.


Change \@ref(eq:prior-mod-no-cov) to
\begin{equation}
  \mu_i = \sum_{m=0}^M \beta_{j_i^m}^{(m)} + (\pmb{Z} \pmb{\eta})_i (\#eq:prior-mod-no-cov)
\end{equation}

where 
- $\pmb{Z}$ is an $I \times P$ matrix of covariates; and
- $\pmb{\eta}$ is a vector of coefficients. 

The covariate matrix $\pmb{Z}$ is derived from the raw covariate data by scaling any numeric variables to have mean 0 and standard deviation 1, and by converting any categorical variables to sets of indicator variables. The conversion to indicator variables follows the rules that R uses for "treatment" contrasts. If the categorical has $C$ categories, then $C-1$ indicator variabls are constructed, with the first category being omitted.

Each element of $\pmb{\eta}$ has prior
\begin{equation}
  \eta_p \sim \text{N}(0, 1)
\end{equation}


# Example: Births in South Korea

To illustrate the use of covariates, we will analyse data on births in South Korea.

## Preliminaries

Besides **bage** itself, we use **dplyr** and **vctrs** for data manipulation, and **ggplot2** for plotting results.

```{r}
suppressPackageStartupMessages({
  library(bage)
  library(dplyr)
  library(ggplot2)
})

```

Our data is a subset of the the data frame `kor_births`, which is part of **bage**.

```{r}
births <- kor_births |>
  filter(region %in% levels(region)[1:5])
```

The data frame gives numbers of births disaggregated by age of mother, region, and year. It also contains a numeric variable called `gdp_pc_2023` that gives GDP per capita (in thousands of US dollars) in 2023, and a categorical variable (with levels `"Low"`, `"Medium"`, and `"High"`) describing population density in 2020.

```{r}
births
```

## Covariates that bring in extra information

The variables `gdp_pc_2023` and `dens_2020` are both examples of covariates that contribute extra information to the model, beyond what is contained in the outcome, exposure, and classifying variables. 

We use function `set_covariates()` to instruct `mod_pois()` to treat these variables as covariates.
```{r}
mod_gdp_dens <- mod_pois(births ~ (age + region + time)^2,
                         data = births,
                         exposure = popn) |>
  set_covariates(~ gdp_pc_2023 + dens_2020) |>
  fit()
mod_gdp_dens
```

To obtain estimates of the coefficients (ie estimates of the $\zeta_p$) we call function `components()` and filter out rows for the `"covariates"` term.

```{r}
mod_gdp_dens |>
  components() |>
  filter(term == "covariates")
```


## Covariates that allow for unusual subsets

In East Asia, years with the Dragon zodiac sign sometimes have larger-than-usual numbers of births. To allow for this possibility, we create a dragon-year covariate, and incorporate it into a new model.

```{r}
births <- births |>
  mutate(is_dragon_year = time == 2012)
mod_dragon <- mod_pois(births ~ (age + region + time)^2,
                      data = births,
                      exposure = popn) |>
  set_covariates(~ is_dragon_year) |>
  fit()

mod_dragon |>
  components() |>
  filter(term == "covariates")
```

There is some evidence for extra births, though there is substantial uncertainty about the size of the effect.

Next we expand the model to allow the dragon-year effect to differ across age groups. We create a variable that takes the values `"baseline"` in all years, except in dragon years, when it takes the name of the age group. We turn this variable into a factor with `"baseline"` as its first level.

```{r}
births <- births |>
  mutate(is_dragon_year_age = if_else(time == 2012, age, "baseline"),
         is_dragon_year_age = factor(is_dragon_year_age, 
                                     levels = c("baseline", unique(age))))
births |>
  filter(time %in% 2011:2013)
```

We create a new model with the age-sepcific dragon-year indicator.

```{r}
mod_dragon_age <- mod_pois(births ~ (age + region + time)^2,
                         data = births,
                         exposure = popn) |>
  set_covariates(~ is_dragon_year_age) |>
  fit()
mod_dragon_age
```

Rather than a single dragon-year coefficient, we have a coefficient for each age group. We extract them and tidy up the labels.

```{r}
mod_dragon_age |>
  components() |>
  filter(term == "covariates") |>
  mutate(age = sub("is_dragon_year_age", "", level)) |>
  select(age, .fitted)
```


## Forecasting

If all the covariates in a model are fixed, then the model can be forecasted as normal. 

```{r}
mod_gdp_dens |>
  forecast(labels = 2024:2025)
```

If, however, a covariate varies over time, forecasting only works if values for future periods are provided. The following code will result in an error:
```
mod_dragon |>
  forecast(labels = 2024:2025)
```

Instead we need to create a `newdata` data frame...

```{r}
newdata <- expand.grid(age = unique(kor_births$age),
                       region = unique(kor_births$region),
                       time = 2024:2025) |>
  mutate(is_dragon_year = FALSE)
head(newdata)
```

...and supply it to `forecast()`.

```{r}
mod_dragon |>
  forecast(newdata = newdata)
```