---
title: 'Using "memoria" and "virtualPollen" together'
subtitle: 'Quantifying ecological memory of virtual pollen curves'
author: "Blas M. Benito"
fig_width: 6
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using memoria}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

<!-- --- -->
<!-- title: 'Using "memoria" and "virtualPollen" together' -->
<!-- subtitle: 'Quantifying ecological memory of virtual pollen curves' -->
<!-- author: "Blas M. Benito" -->
<!-- output: -->
<!--   pdf_document:  -->
<!--     df_print: kable -->
<!--     fig_caption: yes -->
<!--     fig_width: 9 -->
<!--     highlight: tango -->
<!--     includes: -->
<!--       in_header: header.tex -->
<!--     keep_tex: yes -->
<!--     latex_engine: xelatex -->
<!--     number_sections: yes -->
<!--     toc: yes -->
<!--     toc_depth: 2 -->
<!-- code_folding: show -->
<!-- --- -->



<!-- knit_hooks$set(document  =  function(x) {sub('\\usepackage[]{color}', '\\usepackage{xcolor}', x, fixed = TRUE)}) -->

<!-- main info on pdf and notebooks -->
<!-- http://rmarkdown.rstudio.com/pdf_document_format.html#latex_engine -->

<!-- check xelatex fonts -->
<!-- fc-list : family -->
<!-- fc-list :fontformat = TrueType -->

<!-- Equations cheatsheet -->
<!-- $a_{b}$ -->
<!-- $a^{b}$ -->
<!-- $\alpha, \beta, \ldots$ -->
<!-- $\sum_{n = 1}^N$ -->
<!-- $\times$ -->
<!-- $\prod_{n = 1}^N$ -->
<!-- $<, \leq, \geq$ -->
<!-- $\sim$ -->
<!-- $\widehat{\alpha}$ -->
<!-- $\bar{x}$ -->
<!-- $\frac{a}{b}$ -->
<!-- $\left(\frac{a}{b}\right)$ -->

```{r, echo = FALSE, warning = FALSE, message = FALSE, eval=FALSE}

#checking if required packages are installed, and installing them if not
list.of.packages <- c("ggplot2", "cowplot", "knitr", "viridis", "tidyr", "formatR", "grid", "zoo", "ranger", "rpart", "rpart.plot", "HH", "kableExtra", "magrittr", "stringr", "dplyr", "devtools")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, dep = TRUE ,repos = "http://cran.us.r-project.org")

#install virtualPollen if not installed
if(!("virtualPollen" %in% installed.packages())){
  library(devtools)
  install_github("blasbenito/virtualPollen")
}

#install memoria if not installed
if(!("memoria" %in% installed.packages())){
  library(devtools)
  install_github("blasbenito/memoria")
}

# source("ecological_memory_functions.R")
library(virtualPollen)
library(memoria)
library(ggplot2) #plotting library
library(cowplot) #plotting library
library(viridis) #pretty plotting colors
library(grid)    #plotting 
library(tidyr)
library(formatR)
library(zoo)     #time series analysis
library(HH)      #variance inflation factor (multicollinearity analysis)
library(kableExtra) #to fit tables to pdf page size
library(magrittr) #kableExtra requires pipes
library(pdp)     #partial dependence plots
library(ranger)  #fast Random Forest implementation
library(rpart)   #recursive partitions trees
library(rpart.plot) #fancy plotting of rpart models
library(stringr) #to parse variable names
library(knitr)

options(scipen = 999)

# setting code font size in output pdf, from https://stackoverflow.com/a/46526740
def.chunk.hook  <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
  x <- def.chunk.hook(x, options)
  ifelse(options$size != "normalsize", paste0("\\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
})

#trying to line-wrap code in pdf output
#from https://github.com/yihui/knitr-examples/blob/master/077-wrap-output.Rmd
knitr::opts_chunk$set(echo = TRUE, fig.pos = "h")
  opts_chunk$set(tidy.opts = list(width.cutoff = 80), tidy = FALSE)

rm(list.of.packages, new.packages)

#colors
colorexo <- viridis(3)[2]
colorendo <- viridis(3)[1]

```



**Summary**

This document describes in detail how to analyze ecological memory patterns present in simulated pollen curves generated with the *virtualPollen* and *memoria* packages. First, we describe the complex statistical properties of the virtual pollen curves produced by *virtualPollen* and how these may impact ecological memory analyses; Second we explain how Random Forest works, from its basic components (regression trees) to the way in which it computes variable importance;  Third, we explain how to analyze ecological memory patterns on the simulation outputs.

   
\pagebreak

# The model

Analyzing ecological memory requires to fit a model of the form: 

**Equation 1** (simplified from the one in the paper): \Large $$p_{t} = p_{t-1} +...+ p_{t-n} + d_{t} + d_{t-1} +...+ d_{t-n}$$ \normalsize

Where:

+  $p$ is *Pollen*.
+  $d$ is *Driver*.
+  $t$ is the time of any given value of the response $p$.
+  $t-1$ is the lag 1.
+  $p_{t-1} +...+ p_{t-n}$ represents the endogenous component of ecological memory.
+  $d_{t-1} +...+ d_{t-n}$ represents the exogenous component of ecological memory.
+  $d_{t}$ represents the concurrent effect of the driver over the response.

The function *prepareLaggedData* shown below organizes the input data in *lags*. It requires to identify what columns in the original data should act as response, drivers, and time, and what lags are to be generated.

```{r, size="small", eval=FALSE}
#loading data
data(simulation) #from virtualPollen
sim <- simulation[[1]]

#generating vector of lags (same as in paper)
lags <- seq(20, 240, by = 20)

#organizing data in lags
sim.lags <- prepareLaggedData(
  input.data = sim,
  response = "Pollen",
  drivers = c("Driver.A", "Suitability"),
  time = "Time",
  lags = lags,
  scale = FALSE
  )
```

This function returns the data shown in **Table 1**. This kind of data structure is known as *lagged data* or *time delayed data*. Note that the function can use a *scale* argument (set to FALSE above) to standardize the data before generating the lags. Random Forest does not generally require standardization to fit accurate models of the data, but computing variable importance with variables having large differences in range (i.e. [1, 10] vs. [1, 10000]) might yield biased results, making standardization a recommended step in data preparation. In this appendix all data are shown without any standardization to let the reader to keep track of the different variables across analyses and have a sense of their magnitude, but note that all analyses presented in the paper were based on standardized data.


```{r, echo=FALSE, eval=FALSE}
#copy to be modified for the table
sim.lags.table <- sim.lags

#printing table
row.names(sim.lags.table)<-NULL
sim.lags.table$time<-NULL
sim.lags.table <- round(sim.lags.table, 1)
kable(round(sim.lags.table[1:25, 1:26], 2), col.names = c(paste("p", c(0, lags), sep = ""), paste("d", c(0, lags), sep = "")), caption="First rows of the lagged data. Numbers represent lag in years, letter p represents pollen, and letter d represents driver. Column p0 (in bold) indicates the response variable", booktabs = T, format="latex") %>% kable_styling(latex_options = c("scale_down", "hold_position", "striped")) %>% column_spec(1, bold=T)
 #fits table to page width

rm(sim.lags.table)
```

The data in **Table 1** are organized to fit the model described by **Equation 1**, but to select a proper method to fit the model, three main features of the data have to be considered first: **temporal autocorrelation**, **multicollinearity**, and **non-linearity**.

# The data

The simulations produced by *virtualPollen* have some key properties that justify the use of Random Forest as analytical tool.

## Temporal autocorrelation

Temporal autocorrelation (also *serial correlation*) refers to the relationship between successive values of the same variable present in most time series. Temporal autocorrelation generates autocorrelated residuals in regression analysis, violating the assumption of "independence of errors" required to correctly interpret regression coefficients. Several methods can be used to address temporal autocorrelation in regression analysis, such as increasing time intervals between consecutive samples, or incorporating an auto-regressive structure into the model.

Every variable used in our study presents this characteristic. The driver was generated with a temporal autocorrelation significant for periods of 600 years. The suitability produced by the niche function of the virtual taxa based on the values of the driver also presents temporal autocorrelation, but generally lower than the one of the driver. Finally, the response, since it is the result of a dynamic model in which every data-point depends on the previous one, also shows a temporal structure, which varies depending on the taxa's traits, as so does the suitability (see **Figure 2**). 



```{r, fig.height = 3, fig.width = 9, echo = FALSE, fig.cap = "Temporal autocorrelation of the variables in the example data.", eval=FALSE}

#plotting autocorrelation of the driver
p.driver <- ggplot(data = acfToDf(sim.lags[,"Driver.A_0"], 400, 40), aes(x = lag, y = acf)) +
  geom_hline(aes(yintercept = 0)) + 
  geom_hline(aes(yintercept = ci.max), color = "red", linetype = "dashed") +
  geom_hline(aes(yintercept = ci.min), color = "red", linetype = "dashed") +
  geom_segment(mapping = aes(xend = lag, yend = 0)) +
  ggtitle("Driver") +
  xlab("") +
  ylab("Pearson correlation") +
  scale_y_continuous(limits = c(-0.3, 1))

#plotting autocorrelation of the suitability
p.suitability <- ggplot(data = acfToDf(sim.lags[,"Suitability_0"], 400, 40), aes(x = lag, y = acf)) +
  geom_hline(aes(yintercept = 0)) + 
  geom_hline(aes(yintercept = ci.max), color = "red", linetype = "dashed") +
  geom_hline(aes(yintercept = ci.min), color = "red", linetype = "dashed") +
  geom_segment(mapping = aes(xend = lag, yend = 0)) +
  ggtitle("Suitability") +
  xlab("Lag (years)") +
  ylab("") +
  theme(axis.title.y = element_blank(), 
        axis.line.y = element_blank(), 
        axis.ticks.y = element_blank(), 
        axis.text.y = element_blank())+
  scale_y_continuous(limits = c(-0.3, 1))

#plotting autocorrelation of the response
p.response <- ggplot(data = acfToDf(sim.lags[,"Response_0"], 400, 40), aes(x = lag, y = acf)) +
  geom_hline(aes(yintercept = 0)) + 
  geom_hline(aes(yintercept = ci.max), color = "red", linetype = "dashed") +
  geom_hline(aes(yintercept = ci.min), color = "red", linetype = "dashed") +
  geom_segment(mapping = aes(xend = lag, yend = 0)) +
  ggtitle("Response") +
  xlab("") +
  ylab("")+
  scale_y_continuous(limits = c(-0.3, 1)) +
  theme(axis.title.y = element_blank(), 
        axis.line.y = element_blank(), 
        axis.ticks.y = element_blank(), 
        axis.text.y = element_blank())


plot_grid(p.driver, p.suitability, p.response, ncol = 3, rel_widths = c(1.2, 1, 1)) + theme(plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"))

rm(p.driver, p.response, p.suitability)

```

## Multicollinearity

Multicollinearity occurs when there is a high correlation between predictors in a model definition. It increases the standard error of the coefficients, meaning that their estimates for important predictors can become statistically insignificant, wildly impacting model interpretation. 

Adding consecutive time-lags of the same variables to the data, as required by the model expressed in **Equation 1** largely increases multicollinearity. 

```{r , echo = FALSE, message=FALSE, warning=FALSE, error=FALSE, cache=FALSE, eval=FALSE}

#generates a list to save vif results
vif.list <- list()

#iterates through datasets Annual, 1cm, 2cm, 6cm, and 10cm
for(i in 1:4){
  
  #getting simulation output
  sim.temp <- simulation[[i]]
  
  

  #generating lags
  sim.temp.lags <- prepareLaggedData(input.data = sim.temp,
                            response = "Pollen",
                            drivers = "Driver.A",
                            time = "Time",
                            lags = lags)
  
  #removing time
  sim.temp.lags$time <- NULL

  #computing varianca inflation factor (vif function from HH library)
  vif.list[[i]] <- data.frame(HH::vif(sim.temp.lags[,2:ncol(sim.temp.lags)]))
  
}

#list results into dataframe
vif.df <- do.call(cbind, vif.list)

#colnames
colnames(vif.df) <- c("Taxon 1", "Taxon 2", "Taxon 3", "Taxon 4")

#rownames
rownames(vif.df) <- c(paste("p", lags, sep = ""), paste("d", c(0, lags), sep = ""))

#rounding
vif.df <- round(vif.df, 1)
```


```{r , echo = FALSE, cache=FALSE, eval=FALSE}
kable(vif.df, caption = "Variance inflation factor (VIF) of the predictors used in the simulations. VIF values higher than 5 indicate that the given predictor is a linear combination of other predictors.", booktabs = T, format="latex")  %>% kable_styling(latex_options = c("hold_position", "striped"))

```

```{r , echo = FALSE, cache=FALSE, message=FALSE, warning=FALSE, error=FALSE, eval=FALSE}
rm(vif.df, vif.list, sim.temp, sim.temp.lags)
```

\newpage

## Non-linearity

The function *virtualPollen::simulatePopulation* has the ability to produce pollen abundances variably decoupled from environmental conditions depending on the life-traits and niche features of the virtual taxa considered. This model property increases the chance of finding non-linear relationships between time-lagged predictors and the response (see **Figure 3**), hindering the detection of meaningful relationships with methods not able to account for non-nonlinearity.



```{r, fig.height = 8, fig.width = 9, echo = FALSE, fig.cap = "Linear and non-linear relationships arising from lagged data in the annual dataset.", cache=FALSE, eval=FALSE}

#creating copy of sim.lags
sim.lags.plot <- sim.lags

#shorter colnames for sim.lags to simplify header of output table
colnames(sim.lags.plot) <- c(paste("Pollen_", c(0, lags), sep = ""), paste("Driver_", c(0, lags), sep = ""), paste("Suitability_", c(0, lags), sep = ""), "time")

#getting a few lags only for simpler plotting
sim.lags.plot <- sim.lags.plot[, c("Pollen_0", "Pollen_20", "Pollen_120", "Pollen_220", "Driver_20", "Driver_120", "Driver_220", "Suitability_20", "Suitability_120", "Suitability_220", "time")]

#to long format for easier plotting
sim.lags.plot.long <- gather(sim.lags.plot, variable, value, 2:10)

#keeping levels in order
sim.lags.plot.long$variable = factor(sim.lags.plot.long$variable, levels = c("Pollen_20", "Pollen_120", "Pollen_220", "Driver_20", "Driver_120", "Driver_220", "Suitability_20", "Suitability_120", "Suitability_220"))

#plot
ggplot(data = sim.lags.plot.long, aes(y = Pollen_0, x = value, group = variable, color = time)) + 
  geom_point(alpha = 0.4, shape = 16, size = 2) +
  facet_wrap("variable", scales = "free_x", ncol = 3) +
  scale_color_viridis() +
  theme(legend.position = "bottom", 
        legend.key.width = unit(2.5,"cm"), 
        panel.spacing = unit(1, "lines"),
        plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm")) +
  ylab("Response (Pollen_0)") +
  labs(color = "Time (years)")

rm(sim.lags.plot, sim.lags.plot.long)
```

# The logics behind Random Forest 

## The trees

The fundamental units of a Random Forest model are **regression trees**. A regression tree grows through **binary recursive partition**. Considering a *response* variable, and a set of *predictive variables* (also named *features* in the machine learning language), the following steps grow a regression tree:


+ For each variable, the point in their ranges that optimizes the separation (partition) of the response in two groups of cases is searched for. The selected point minimizes the sum of the squared deviations from the mean in the two separated partitions.
+ The variable with the lower sum of the squared deviations from the means is selected as the *root node* of the tree, and the data are separated in two partitions, one on each side of the *split value* of the given variable.
+ The steps above are recursively applied to each partition to create new partitions, until all cases are in partitions that can be no longer separated in smaller partitions because they are too homogeneous, or because they have reached a minimum sample size. These final partitions are named *terminal nodes*.
  
The code below shows how to fit a recursive partition tree with the *rpart* library on the first lag (20 years) of pollen and driver of the data shown in **Figure 2**.


   
```{r, fig.height = 3, fig.width = 9, fig.cap = "Recursive partition tree (also regression tree) of pollen abundance (noted as Response_0 in the model) as a function of Response_20 (antecedent pollen abundance at lag 20) and Driver_20 (antecedent driver values at lag 20). Numbers in branches represent split values, while numbers in terminal nodes represent the predicted average for that particular node. Percentages represent the relative number of cases included in each terminal node.", message = FALSE, error = FALSE, warning = FALSE, cache=FALSE, size="small", eval=FALSE}

#fitting model (only two predictors)
rpart.model <- rpart(
  formula = Response_0 ~ Response_20 + Driver.A_20,
  data = sim.lags,
  control = rpart.control(minbucket = 5)
  )

#plotting tree
rpart.plot(
  rpart.model, 
  type = 0, 
  box.palette = viridis(10, alpha = 0.2)
  )
```


```{r , fig.height = 5, fig.width = 9, size = "footnotesize", fig.cap = "Recursive partition surface generated by the model fitted above. Dots represent observed data, and colours identify partitions shown in the recursive partition tree. Note that this partition surface can be generalized to any number of predictors/dimensions.", message = FALSE, error = FALSE, warning = FALSE, echo=FALSE, cache=FALSE, eval=FALSE}

#plotting binary partition
plotInteraction(model = rpart.model, 
                data = sim.lags, 
                x = "Driver.A_20", 
                y = "Response_20", 
                z = "Response_0",
                point.size.range = c(0.1, 5)) 
``` 
 
 
```{r, echo=FALSE, eval=FALSE}
rm(rpart.model)
```
 
 
**Figure 4** shows the recursive partition tree fitted on Pollen_0 as a function of the first lag of pollen (Pollen_20) and the driver (Driver_20), while **Figure 5** shows the partitions in the space of both variables. As shown in both figures, the recursive partition tree is, in essence, separating the cases into regions in which given combinations of the predictors lead to certain average values of the response. The tree also shows the hierarchy in importance between both predictors, with *Pollen_20* defining all splits but one. Only when *Pollen_20* is higher than 3772, the variable *Driver_20* becomes important, indicating that maximum abundances are only reached after that point, and only if *Driver_20* has a value lower than 71. This is how partial interactions among predictors are expressed in recursive partition trees.

The tree has grown until data in the terminal nodes cannot be separated further into additional partitions, or has reached the minimum number of cases defined by the variable *minbucket*. The minimum amount of cases in a terminal node defines the overall resolution of the model. Smaller numbers lead to a higher amount of terminal nodes, and therefore to more partitions in the data space. This can be confirmed by changing the *minbucket* value in the code above, and assessing subsequent changes in tree structure and number of partitions.

As a drawback, the splits of a recursive partition trees are highly sensitive to small changes in the input data, specially when sample size is small. This instability has led to the development of more sophisticated methods to fit recursive partition trees, such as *conditional inference trees* (see function *ctree* in library *partykit*), or ensemble methods such as Random Forest.

## The forest

A Random Forest model is composed by a large number of individual regression trees (500 or more) generated on random subsamples of the predictors and the cases. For a random set of cases, each tree is fitted as follows:


+ A random subset of predictors of size *mtry* is selected. The default *mtry* is the rounded-up squared root of the total number of predictors, but the user can modify it.
+ The predictor from the random subset that better separates the data into two partitions is selected as *root node*, an the data are separated in two partitions, one at each side of the *split value*.
+ On each partition, a new random subset of predictors of size *mtry* is selected (and this is the main difference between a recursive partition tree and a Random Forest tree, the former uses all variables on each split), and again the predictor that better separates the partition into two new partitions is selected, and new partitions are defined.
+ The tree is grown until minimum node size is reached in all terminal nodes, or no further partitions can be defined.
+ The tree is evaluated by computing its mean squared error (mse) on the ~37% of the data not used to train it (named *out-of-bag data*).
+ For each variable in the tree the algorithm performs a permutation test as follows:
  + The column with the given variable is randomly permuted.
  + A new tree is fitted with the permuted variable.
  + Mean squared error is computed again on the *out-of-bag* data.
  + Difference in mse between the tree fitted with the original variable and the tree fitted with the permuted one is computed and stored.
          
Once all trees have been fitted, for every given case, the prediction is computed as the mode of the individual predictions of every tree (but not the ones fitted with permuted variables). The importance of every variable is computed as the average of the differences in mean squared error between trees computed with the variable and trees computed with the permuted variable, normalized by the standard deviation of the differences.

Random Forest does not require any assumptions to be fulfilled by the data or the model outcomes, and therefore it can be applied to a wide range of analytic cases where data are non-linear. As a drawback, the randomness in the selection of cases and predictors to fit individual regression trees turns it into a non-deterministic algorithm, and therefore, fine-scale variations in the outcomes are to be expected between different runs with the same data.

To fit Random Forest models on the simulated data we selected the package *ranger* over the more traditional *randomForest* because the former allows multithread computing (uses all available cores in a computer while fitting the forest), achieving a better performance for large datasets than the later. The code below shows how to use *ranger* to fit a Random Forest model.


```{r , fig.height = 4, fig.cap = "Random Forest output for the interaction between Pollen_20 and Driver_20, to allow the comparison with the recursive partition tree shown in Figure 4. To compute this interaction surface, all other variables in the model are centered in their mean.", message = FALSE, error = FALSE, warning = FALSE, results="hide", cache=FALSE, size="small", eval=FALSE}

#getting columns containing "Response" or "Driver"
sim.lags.rf <- sim.lags[, grepl("Driver|Response", colnames(sim.lags))]

#fitting a Random Forest model
rf.model <- ranger(
  data = sim.lags.rf,
  dependent.variable.name = "Response_0",
  num.trees = 500,
  min.node.size = 5, 
  mtry = 2,
  importance = "permutation", 
  scale.permutation.importance = TRUE)

#model summary
print(rf.model)

#R-squared (computed on out-of-bag data)
rf.model$r.squared

#variable importance
rf.model$variable.importance

#obtain case predictions
rf.model$predictions

#getting information of the first tree
treeInfo(rf.model, tree=1)
```

The function *ranger* has the following key arguments:

   
+ **data**: dataframe with the variables to be included in the model.
+ **dependent.variable.name**: model definition can be done in two ways, either through a formula, or through the argument *dependent.variable.name*, that names the response variable, and uses the remaining variables in the dataset as predictors, which forces us to be careful with what variables are available in the dataset. 
+ **num.trees**: controls number of trees generated (the default value is 500).
+ **mtry**: controls the number of variables randomly selected to fit each tree. In the code above this argument is set to 2, indicating that the model only considers interactions among two predictors only on each tree. This allows to compute variable importance as independently as possible from other variables. 
+ **min.node.size**: minimum number of cases in a terminal node, which determines the overall resolution of the model.
+ **importance**: when set to "permutation" it triggers the computation of variable importance through permutation tests.
+ **scale.permutation.importance**: scales importance values computed through the permutation tests by the overall standard error.
    
The relationship between the response variable and the predictors can be examined through *partial dependence plots* (see **Figure 6**). A partial dependence plot is a simplified view of the inner structure of the model. Since regression trees consider interactions among variables, the prediction for any given case depends on the values of all predictors considered at the same time. Since it is not possible to generate such a representation in 2D or 3D, partial dependence plots set all variables not represented in the plot to their respective means. Therefore, partial dependence plots must be interpreted as simple approximations to the true shape of the model surface.


    
```{r , fig.width = 9, fig.height = 3, size = "footnotesize", fig.cap = "Partial dependence plots of the lags 20, of Pollen (A) and Driver (B), and the concurrent effect (Driver_0, panel C).", message = FALSE, error = FALSE, warning = FALSE, echo = FALSE, cache=FALSE, eval=FALSE}
#partial dependence plots

#some things for the plots
gg.scale <- scale_y_continuous(limits = c(2000, 3500))
no.y.axis <- theme(axis.line.y = element_blank(), 
                   axis.ticks.y = element_blank(), 
                   axis.text.y = element_blank(),
                   axis.title.y = element_blank())

#plotting several partial dependence plots (with "pdp" library)
plot.Response_20 <- autoplot(partial(rf.model, pred.var = "Response_20"), ylab = "Response_0", col = colorendo, size = 2) + gg.scale

plot.Driver_20 <- autoplot(partial(rf.model, pred.var = "Driver.A_20"), ylab = "Response_0", col = colorexo, size = 2) + gg.scale + no.y.axis

plot.Driver_0 <- autoplot(partial(rf.model, pred.var = "Driver.A_0"), ylab = "Response_0", col = colorexo, size = 2) + gg.scale + no.y.axis


plot_grid(plot.Response_20,plot.Driver_20, plot.Driver_0, ncol = 3, labels = c("A", "B", "C"), rel_widths = c(1.2, 1, 1))

rm(plot.Driver_20, plot.Response_20, plot.Driver_0, no.y.axis, gg.scale)
```
    
Interactions among predictors can be represented in the same way done before for recursive partition trees (see **Figure 7**). Again, all variables not represented in the plot are set to their average to generate the prediction.

    
```{r, fig.height=5, fig.width=9, size = "footnotesize" , fig.cap = "Interaction between Pollen_20 (first lag of the endogenous memory) and Suitability_0 (concurrent effect).", message = FALSE, error = FALSE, warning = FALSE, echo = FALSE, cache=FALSE, eval=FALSE}
plotInteraction(model = rf.model,
                data = sim.lags.rf,
                x = "Driver.A_20",
                y = "Response_20",
                z = "Response_0",
                point.size.range = c(0.1, 5))
```
    
## Variable importance
    
Random forest variable importance computation works under the assumption that if a given variable is not important, then permuting its values does not degrade the prediction accuracy. Variable importance scores are extracted with the *importance* function (see code below and **Table 4**), and are interpreted as "how much model fit degrades when the given variable is removed from the model".

    
```{r, eval = FALSE, cache=FALSE, eval=FALSE}
importance(rf.model)
```
    
```{r table4, echo = FALSE, cache=FALSE, eval=FALSE}

#importance to dataframe
rf.importance <- data.frame(importance(rf.model))

#name for the main column
colnames(rf.importance) <- "Importance"

#rownames as variable
rf.importance$Variable <- rownames(rf.importance)

#ordering by importance
rf.importance <- rf.importance[order(rf.importance[,1], decreasing = TRUE), c("Variable", "Importance")]

#rounding
rf.importance$Importance <- round(rf.importance$Importance, 2)

#removing rownames
rownames(rf.importance)<-NULL

#to kable
kable(rf.importance, caption = "Importance scores of a Random Forest model ordered from higher to lower importance. Importance scores are interpreted as increase in model error when the given variable is removed from the model.", booktabs = T, format="latex")  %>%
  kableExtra::kable_styling(latex_options = c("striped"))
```

Values shown in **Table 4** are the result of one particular Random Forest run. For variables with small differences in importance, the ranking shown in the table could change in a different model run. This situation can be addressed by running the model several times, and computing the average and confidence intervals of the importance scores of each predictor across runs. This is shown in the code below (see output in **Figure 8**).

```{r, cache = FALSE, message = FALSE, error = FALSE, warning = FALSE , cache=FALSE, size="footnotesize", eval=FALSE}

#number of repetitions
repetitions <- 10

#list to save importance results
importance.list <- list()

#repetitions
for(i in 1:repetitions){
  
  #fitting a Random Forest model
  rf.model <- ranger(
    data = sim.lags.rf,
    dependent.variable.name = "Response_0",
    mtry = 2,
    importance = "permutation", 
    scale.permutation.importance = TRUE
    )

  #extracting importance
  importance.list[[i]] <- data.frame(t(importance(rf.model)))
}

#into a single dataframe
importance.df <- do.call("rbind", importance.list)
```


\newpage

```{r, fig.height = 9, fig.width=9, fig.cap = "Importance scores of predictors in Random Forest model after 100 repetitions. Note that Response_X predictor refers to the endogenous memory, and Driver_X predictors from lag 20 refer to exogenous memory. Driver_0 represents the concurrent effect.", eval=FALSE, message = FALSE, error = FALSE, warning = FALSE, echo = FALSE, cache=FALSE}
#boxplot
par(mar = c(5, 10, 4, 2) + 0.1)
boxplot(importance.df, 
        horizontal = TRUE, 
        las = 1, 
        cex.axis = 1, 
        cex.lab = 1.2,
        xlab = "Importance (% increment in mse)", 
        notch = TRUE, 
        col = c(rep(colorendo, 12), rep(colorexo, 13)))
```

\newpage

## Testing the significance of variable importance scores

Random Forest does not provide any tool to assess the significance of these importance scores, and it is therefore impossible to know at what point they become irrelevant. A simple solution is to add a random variable as an additional predictor to the model and compute its importance. If the importance of other variables is equal or lower than the importance of the random variable, it can be assumed that these variables do not have a meaningful effect on the response, and can therefore be considered irrelevant. 

Two types of random variables can be considered to be used as benchmarks to test variable importance scores provided by Random Forest: **white noise** (without any temporal structure), and **random walk with temporal structure** (as explained in Appendix I). In both cases the idea is to generate a *null model* providing a baseline to assess to what extent importance scores are higher than what is expected by chance. To test the suitability of both methods, the code below generates 10 Random Forest models, each one with two additional random variables: *random.white* representing white noise, and *random.autocor* representing an autocorrelated random walk. The length of the autocorrelation period of *random.autocor* is changed for every iteration.

```{r, cache=FALSE, size="footnotesize", eval=FALSE}

#number of repetitions
repetitions <- 10

#list to save importance results
importance.list <- list()

#rows of the input dataset
n.rows <- nrow(sim.lags.rf)

#repetitions
for(i in 1:repetitions){
  
  #adding/replacing random.white column
  sim.lags.rf$random.white <- rnorm(n.rows)
  
  #adding/replacing random.autocor column
  #different filter length on each run = different temporal structure
  sim.lags.rf$random.autocor <- as.vector(
    filter(rnorm(n.rows),
           filter = rep(1, sample(1:floor(n.rows/4), 1)),
           method = "convolution",
           circular = TRUE))

  #fitting a Random Forest model
  rf.model <- ranger(
    data = sim.lags.rf,
    dependent.variable.name = "Response_0",
    mtry = 2,
    importance = "permutation", 
    scale.permutation.importance = TRUE)

  #extracting importance
  importance.list[[i]] <- data.frame(t(importance(rf.model)))
  
}

#into a single dataframe
importance.df <- do.call("rbind", importance.list)
```


```{r, fig.height = 9, fig.width=9, fig.cap = "Importance of predictors in relation to the importance of a white noise variable (gray), and a temporally structured random variable (yellow). Solid lines represent the medians of the random variables, while dashed lines represent their maximum importance across 100 model runs.", message = FALSE, error = FALSE, warning = FALSE, echo=FALSE, cache=FALSE, eval=FALSE} 

#boxplot
par(mar = c(5, 10, 4, 2) + 0.1)
boxplot(importance.df, 
        horizontal = TRUE, 
        las = 1, 
        cex.axis = 1, 
        cex.lab = 1.2,
        xlab = "Importance (% increment in mse)", 
        notch = TRUE, 
        col = c(rep(colorendo, 12), rep(colorexo, 13), "gray80", "#FEEE62"))
abline(v = quantile(importance.df$random.autocor, probs = c(0.5, 1)), col = "#FEEE62", lwd = c(3,2), lty = c(1,2))
abline(v = quantile(importance.df$random.white, probs = c(0.5, 1)), col = "gray80", lwd = c(3,3), lty = c(1,2))

```

```{r, echo=FALSE, message=FALSE, warning=FALSE, error=FALSE, results="hide", cache=FALSE, eval=FALSE}
rm(importance.df, importance.list, rf.importance, rf.model, i, n.rows, repetitions, sim.lags.rf)
# gc()
```


The boxplot in **Figure 9** shows the relative importance of the random variables, and suggests that the variable representing random noise is not useful to identify importance scores arising by chance. On the other hand, the variable based on autocorrelated random walks (marked in yellow in the plot) tells a different story. Importance values below the yellow solid line have a probability higher than 0.5 of being the result of chance. Importance values between the yellow solid and dashed lines have probabilities between 0.5 and 0 and are the result of a random association between a predictor and the response, while beyond the dashed line the results can be confidently defined as non-random. Note that **Figure 8**, when compared with **Figure 7**, also shows that adding random variables to a Random Forest model does not change the importance scores of other variables in the model.

# Analyzing ecological memory with Random Forest and the *memoria* library

So far we have explained how to organize the simulated pollen curves in lags, and how to fit Random Forest models on the lagged data to evaluate variable importance. However, further steps are required to quantify ecological memory patterns:


+  Extract and aggregate variable importance scores, and organize them in ecological memory components (endogenous, exogenous, and concurrent).
+  Plot the pattern to facilitate interpretation.
+  Extract ecological memory features from these components, namely **memory strength** (maximum difference in relative importance between each component (endogenous, exogenous, and concurrent) and the median of the random component), **memory length** (proportion of lags over which the importance of a memory component is above the median of the random component), and **dominance** (proportion of the lags above the median of the random term over which a memory component has a higher importance than the other component).

The function **computeMemory** fits as many Random Forest models as indicated by the argument *repetitions* on a lagged dataset, and on each iteration includes a random variable in the model. The function **plotMemory** gets the output of **computeMemory** and plots it, while the function *extractMemoryFeatures* computes the features of each ecological memory component. The simplified workflow is shown below.

```{r, fig.height = 4, fig.width=9, fig.cap = "Example of ecological memory pattern. Note that the model was only run for 30 repetitions, and therefore the position of the median of the Random component is not reliable.", message = FALSE, error = FALSE, warning = FALSE, results="asis", cache=FALSE, eval=FALSE} 

#computes ecological memory pattern
memory.pattern <- computeMemory(
  lagged.data = sim.lags, 
  drivers = "Driver.A", 
  random.mode="autocorrelated",
  repetitions=30, 
  response="Response"
  )

#computing memory features
memory.features <- extractMemoryFeatures(
  memory.pattern=memory.pattern,
  exogenous.component="Driver.A",
  endogenous.component="Response"
  )

#plotting the ecological memory pattern
plotMemory(memory.pattern)
```

```{r, echo=FALSE, cache=FALSE, eval=FALSE}
#Table of memory features
memory.features.t <- round(t(memory.features[, 2:8]), 2)
kable(memory.features.t, caption = "Features of the ecological memory pattern shown in Figure 10 by using the extractMemoryFeatures function.", booktabs = T, format="latex")  %>%
  kableExtra::kable_styling(latex_options = c("hold_position", "striped"))
```

In order to analyze the ecological memory patterns of 16 virtual taxa across the 5 levels of data quality (Annual, 1cm, 2cm, 6cm, and 10cm), we integrated the functions above into a larger function named *runExperiment*. The code below runs an artificial simple example with only two virtual taxa (1 and 6 in *parameters*), and two dataset types ("1cm" and "10cm"). Only 30 repetitions are generated to quicken execution, which is not nearly enough to achieve consistent results.

```{r, results="hide", cache=FALSE, warning=FALSE, message=FALSE, error=FALSE, cache=FALSE, eval=FALSE}

#running experiment
E1 <- runExperiment(
  simulations.file = simulation,
  selected.rows = 1:4,
  selected.columns = 1,
  parameters.file = parameters,
  parameters.names = c("maximum.age",
                       "fecundity",
                       "niche.A.mean",
                       "niche.A.sd"),
  sampling.names = "1cm",
  driver.column = "Driver.A",
  response.column = "Pollen",
  time.column = "Time",
  lags = lags,
  repetitions = 30
  )

#E1 is a list of lists
#first list: names of experiment output
E1$names 

#second list, first element
i <- 1 #change to see other elements
#ecological memory pattern
E1$output[[i]]$memory

#pseudo R-squared across repetitions
E1$output[[i]]$R2

#predicted pollen across repetitions
E1$output[[i]]$prediction

#variance inflation factor of input data
E1$output[[i]]$multicollinearity

```

Experiment results can be plotted with the function *plotExperiment* shown below.

```{r, fig.height = 8, fig.width=9, cache = FALSE, fig.cap = "Ecological memory patterns of four virtual taxa. Note that the number of repetitions (30) is too low to obtain a reliable median of the Random component. Abbreviations in plot title: ma - maximum age, f - fecundity, Am - niche mean, Asd - niche breadth, smp - sampling resolution, R2 - pseudo R-squared.", message = FALSE, error = FALSE, warning = FALSE, cache=FALSE, eval=FALSE}
plotExperiment(
  experiment.output=E1,
  parameters.file=parameters,
  experiment.title="Toy experiment",
  sampling.names=c("1cm", "10cm"),
  legend.position="bottom",
  R2=TRUE
  )
```

The experiment data can be organizes as a single table, joined with the data available in the *parameters* dataframe to facilitate further analyses.

```{r, cache=FALSE, eval=FALSE}
E1.df <- experimentToTable(
  experiment.output=E1,
  parameters.file=parameters,
  sampling.names=c("1cm", "10cm"),
  R2=TRUE
  )
```



```{r , echo = FALSE, cache=FALSE, eval=FALSE}
#removing a useless column
E1.df$name <- NULL
E1.df$autocorrelation.length.A <- NULL
E1.df$autocorrelation.length.B <- NULL
E1.df$pollen.control <- NULL
E1.df$maximum.biomass <- NULL
E1.df$carrying.capacity <- NULL

#rounding some columns
E1.df$R2mean <- round(E1.df$R2mean, 2)
E1.df$R2sd <- round(E1.df$R2sd, 3)
E1.df$median <- round(E1.df$median, 1)
E1.df$sd <- round(E1.df$sd, 2)
E1.df$min <- round(E1.df$R2sd, 1)
E1.df$max <- round(E1.df$R2sd, 1)

rownames(E1.df) <- NULL

#printing table

kable(E1.df[1:40, ], caption="First rows of the experiments table.", booktabs = T, format="latex") %>% 
  kable_styling(latex_options = c("scale_down", "hold_position", "striped"))  %>%  
  column_spec(1:ncol(E1.df), color="#38598C") %>%
  row_spec(0, col="#585858")

```




Finally, ecological memory features can be extracted from the experiment with *extractMemoryFeatures* in order to facilitate further analyses, as shown below.


```{r, cache=FALSE, eval=FALSE}
E1.features <- extractMemoryFeatures(
  memory.pattern = E1.df,
  exogenous.component = "Driver.A",
  endogenous.component = "Response"
  )
```

```{r, echo=FALSE, cache=FALSE, eval=FALSE}
#rounding some columns
E1.features$length.endogenous <- round(E1.features$length.endogenous, 3)
E1.features$length.exogenous <- round(E1.features$length.exogenous, 3)
E1.features$dominance.endogenous <- round(E1.features$dominance.endogenous, 3)
E1.features$dominance.exogenous <- round(E1.features$dominance.exogenous, 3)
  
  
E1.features.t <- t(E1.features)
kable(E1.features.t, caption = "Features of the ecological memory patterns produced by the example experiment. Features with value NA result from ecological memory components that fall below the median of the random component across all lags.", booktabs = T, format="latex")  %>%
  kableExtra::kable_styling(latex_options = c("hold_position", "striped"))
```

