---
title: "Site selection"
author: "Housen Chu"
output: rmarkdown::html_vignette
vignette: >
 %\VignetteIndexEntry{Site selection}
 %\VignetteEncoding{UTF-8}
 %\VignetteEngine{knitr::rmarkdown}
---

```{r, echo = FALSE, include = FALSE, message = FALSE}
knitr::opts_chunk$set(
  warning = FALSE,
  collapse = TRUE,
  dpi = 120,
  comment = "#>"
)

```

```{r setup, warning = FALSE}
library(amerifluxr)
library(data.table)
library(pander)
```

amerifluxr is a programmatic interface to the 
[AmeriFlux](https://ameriflux.lbl.gov/). This vignette
demonstrates examples to query a list of target sites 
based on sites' general information and availability 
of metadata and data. A companion vignette for 
[Data import](data_import.html) is available as well.     

## Get a site list with general info

AmeriFlux data are organized by individual sites. 
Typically, data query begins with site search and selection.
A full list of AmeriFlux sites with general info can be
obtained using the amf_site_info() function. 

Convert the site list to a data.table for easier 
manipulation. Also see 
[link](https://ameriflux.lbl.gov/data/badm/badm-standards/) 
for variable definition. 

```{r echo=TRUE, results = "asis"}
# get a full list of sites with general info
sites <- amf_site_info()
sites_dt <- data.table::as.data.table(sites)

pander::pandoc.table(sites_dt[c(1:3), ])

```

The site list provides a quick summary of all 
registered sites and sites with available data.

It's often important to understand the data use 
policy under which the data are shared. In 2021, 
the AmeriFlux community moved to the AmeriFlux 
CC-BY-4.0 License. Most site PIs now share their
sites’ data under the CC-BY-4.0 license. Data for
some sites are shared under the historical AmeriFlux
data-sharing policy, now called the AmeriFlux Legacy
Data Policy. 

Check [link](https://ameriflux.lbl.gov/data/data-policy/#data-use)
for data use policy and attribution guidelines.

```{r results = "asis"}
# total number of registered sites
pander::pandoc.table(sites_dt[, .N])

# total number of sites with available data
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N])

# get number of sites with available data, grouped by data use policy
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = .(DATA_POLICY)])

```

Further group sites based on 
[IGBP](https://ameriflux.lbl.gov/data/badm/badm-standards/IGBP). 

```{r results = "asis"}
# get a summary table of sites grouped by IGBP
pander::pandoc.table(sites_dt[, .N, by = "IGBP"])

# get a summary table of sites with available data, & grouped by IGBP
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = "IGBP"])

# get a summary table of sites with available data, 
#  & grouped by data use policy & IGBP
pander::pandoc.table(sites_dt[!is.na(DATA_START), .N, by = .(IGBP, DATA_POLICY)][order(IGBP)])

```

Once decided, users can query a target site list 
based on the desired criteria, e.g., IGBP, data 
availability, data policy, geolocation. 

```{r results = "asis"}

# get a list of cropland and grassland sites with available data,
#  shared under CC-BY-4.0 data policy,
#  located within 30-50 degree N in latitude,
# returned a site list with site ID, name, data starting/ending year
crop_ls <- sites_dt[IGBP %in% c("CRO", "GRA") &
                      !is.na(DATA_START) &
                      LOCATION_LAT > 30 &
                      LOCATION_LAT < 50 &
                      DATA_POLICY == "CCBY4.0",
                    .(SITE_ID, SITE_NAME, DATA_START, DATA_END)]
pander::pandoc.table(crop_ls[c(1:10),])

```

## Get metadata availability

In some cases, users may want to know if certain 
types of metadata are available for the selected 
sites. The amf_list_metadata() function provides 
a quick summary of metadata availability before 
actually downloading the data and metadata. 

By default, amf_list_metadata() returns a full site
list with the available entries (i.e., counts) 
for all BADM groups. Check 
[AmeriFlux webpage](https://ameriflux.lbl.gov/data/badm/badm-basics/)
for definitions of all BADM groups.     

```{r results = "asis"}
# get data availability for selected sites
metadata_aval <- data.table::as.data.table(amf_list_metadata())
pander::pandoc.table(metadata_aval[c(1:3), c(1:10)])
```

The site_set parameter of the amf_list_metadata()
can be used to subset the sites of interest.

```{r results = "asis"}
metadata_aval_sub <- as.data.table(amf_list_metadata(site_set = crop_ls$SITE_ID))

# down-select cropland & grassland sites by interested BADM group,
#  e.g., canopy height (GRP_HEIGHTC)
crop_ls2 <- metadata_aval_sub[GRP_HEIGHTC > 0, .(SITE_ID, GRP_HEIGHTC)][order(-GRP_HEIGHTC)]
pander::pandoc.table(crop_ls2[c(1:10), ])
```

## Get data availability

Users can use amf_list_data() to query the 
availability of specific variables in the data
(i.e., flux/met data, so-called BASE data product).
The amf_list_data() provides a quick summary 
of variable availability (per site/year)
before downloading the data.

By default, amf_list_data() returns a full site 
list of variable availability (data percentages 
per year) for all variables. The site_set parameter 
of amf_list_data() can be used to subset the sites 
of interest. 

```{r results = "asis"}
# get data availability for selected sites
data_aval <- data.table::as.data.table(amf_list_data(site_set = crop_ls2$SITE_ID))
pander::pandoc.table(data_aval[c(1:10), ])
```

The variable availability can be used to subset
sites that have certain variables in specific 
years. The BASENAME column indicates the variable's
base name (i.e., ignoring position qualifier), and
can be used to get a coarse-level variable availability. 

See [AmeriFlux website](https://ameriflux.lbl.gov/data/aboutdata/data-variables/)
for definitions of base names and qualifiers.

```{r results = "asis"}
# down-select cropland & grassland sites based on the available wind speed (WS) and 
# friction velocity (USTAR) data in 2015-2018, regardless their qualifiers
data_aval_sub <- data_aval[data_aval$BASENAME %in% c("WS","USTAR"),
                           .(SITE_ID, BASENAME, Y2015, Y2016, Y2017, Y2018)]

# calculate mean availability of WS and USTAR in each site and each year
data_aval_sub <- data_aval_sub[, lapply(.SD, mean), 
                               by = .(SITE_ID),
                               .SDcols = c("Y2015", "Y2016", "Y2017", "Y2018")]

# sub-select sites that have WS and USTAR data for > 75%
#  during 2015-2018
crop_ls3 <- data_aval_sub[(Y2015 + Y2016 + Y2017 + Y2018) / 4 > 0.75]
pander::pandoc.table(crop_ls3)

```

Last, sometimes users would look for sites with 
multiple measurements of similar variables (e.g.,
multilevel wind speed, soil temperature). The 
VARIABLE column in the variable availability can
be used to get a fine-level variable availability.

```{r results = "asis"}

# down-select cropland & grassland sites by available wind speed (WS) data,
#  mean availability of WS during 2015-2018
data_aval_sub2 <- data_aval[data_aval$BASENAME %in% c("WS"),
                            .(SITE_ID, VARIABLE, Y2015_2018 = (Y2015 + Y2016 + Y2017 + Y2018)/4)]

# calculate number of WS variables per site, for sites that 
#  have any WS data during 2015-2018
data_aval_sub2 <- data_aval_sub2[Y2015_2018 > 0, .(.N, Y2015_2018 = mean(Y2015_2018)), .(SITE_ID)]
pander::pandoc.table(crop_ls4 <- data_aval_sub2[N > 1, ])

```

A companion function amf_plot_datayear() can be used 
for visualizing the data availability in an interactive
figure. However, it is strongly advised to subset
the sites, variables, and/or years for faster processing 
and better visualization.

```{r eval = FALSE}
#### not evaluated so to reduce vignette size
# plot data availability for WS & USTAR
#  for selected sites in 2015-2018
amf_plot_datayear(
  site_set = crop_ls4$SITE_ID,
  var_set = c("WS", "USTAR"),
  nonfilled_only = TRUE,
  year_set = c(2015:2018)
)

```

## Get data summary

In addition, users can use amf_summarize_data() to 
query the summary statistics of specific variables 
in the BASE data. The amf_summarize_data() provides
summary statistics for each variable (e.g., percentiles)
before downloading the data.

By default, amf_summarize_data() returns variable 
summary (selected percentiles) for all variables
and sites. The site_set and var_set parameters 
can be used to subset the sites or variables of 
interest. 

```{r results = "asis"}
## get data summary for selected sites & variables
data_sum <- amf_summarize_data(site_set = crop_ls3$SITE_ID,
                     var_set = c("WS", "USTAR"))
pander::pandoc.table(data_sum[c(1:10), ])

```

Alternatively, a companion function 
amf_plot_datasummary() provides interactive 
visualization to the data summary.

```{r eval = FALSE}
#### not evaluated so to reduce vignette size
## plot data summary of USTAR for selected sites, 
amf_plot_datasummary(
  site_set = crop_ls3$SITE_ID,
  var_set = c("USTAR")
)

```


```{r eval = FALSE}
#### not evaluated so to reduce vignette size
## plot data summary of WS for selected sites, 
#  including clustering information
amf_plot_datasummary(
  site_set = crop_ls3$SITE_ID,
  var_set = c("WS"),
  show_cluster = TRUE
)

```

Once having a target site list, users can download
these sites' data and metadata using the site IDs. 
See [Data import](data_import.html) for data 
download and import examples.