---
title: Design
vignette: >
  %\VignetteIndexEntry{Design}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
---

This page documents the general design of fastreg. It covers some
requirements, the public-facing interface, and some diagrams
highlighting the general flow of the main functions.

::: callout-note
Using R to read SAS can't guarantee perfect preservation of the SAS
values, since reading SAS files in R relies on
[haven](https://haven.tidyverse.org/index.html), which is based on
[ReadStat](https://github.com/WizardMac/ReadStat), a reverse-engineered
effort to read the proprietary SAS file format.

However, haven and the underlying ReadStat are mature packages and
explicitly support reading `sas7bdat` files, which is the register
format used by Statistics Denmark (DST).
:::

## Requirements

The core requirements of fastreg are to:

1. Convert Danish register data from SAS files to the modern and
   efficient Parquet format.
2. Read register Parquet files into R as a DuckDB table.
3. Provide a [targets](https://docs.ropensci.org/targets/) pipeline
   template to convert multiple registers in parallel.
4. Provide helper functions to list available SAS or Parquet register
   files directly from R.

## Interface

The interface (the functions and objects that are exposed to users) is
based on some specific naming conventions. Specifically, we generally
name functions by the **action** they perform and the **object(s)** they
perform it on in the format `{action}_{object}()`. **Actions** are verbs
that describe what a function does, while **objects** are nouns that
represent the objects that the functions operate on. Below is an
overview of the main actions and objects within fastreg.

The actions are:

- `get`: Get or guess some information, e.g., the project ID, workdata
  directory, or rawdata directory from the current working directory, or
  e.g., a register name or year from a file name.
- `list`: List files in a directory, e.g., SAS or Parquet files.
- `convert`: Convert a register SAS file (or multiple) to Parquet.
- `read`: Read a Parquet register into R as a DuckDB table.
- `use`: Set up `_targets.R` and a Quarto log template.

The objects are:

- `chunk_size`: Number of rows to read per chunk during conversion.
- `path`: A character vector of one or more paths.
- `project_id`: A number indicating the project ID on DST.
- `output_dir`: The directory to save the Parquet output to.

The settings are:

- `fastreg.project_rawdata_dir`: The directory where either the SAS or
  Parquet files are stored. The `rawdata/` directory is read-only on the
  DST server and contains the original SAS files. A project manager with
  the correct permissions can move (or request to move) Parquet files
  into this directory.
- `fastreg.project_workdata_dir`: The `workdata/` directory is where
  Parquet files are stored for projects without a project manager and
  where the users don't have permissions to save the converted files
  into `rawdata/`. Usually, this directory is used to store and edit R
  scripts, documents, and other files, but it can also store data files
  (e.g., SAS or Parquet files).

These two settings are used to help make the experience of working with
and managing the conversion and reading of registers smoother.

::: callout-tip
For a list of all the public functions, see the
[Reference](https://dp-next.github.io/fastreg/reference/index.html)
page.
:::

### Converting one SAS file

```{mermaid}
%%| label: fig-flow
%%| fig-cap: "Expected workflow for converting one SAS file from a single register using `convert()`."
%%| fig-alt: "A flowchart showing the expected flow of converting one SAS file to a Parquet file."
flowchart TD
    opts_project_dir("options()")
    list_sas_files("list_sas_files()")
    path[/"path<br>[Character scalar]"/]
    output_dir[/"output_dir<br>[Character scalar]"/]
    chunk_size[/"chunk_size<br>[Integer scalar]"/]
    convert("convert()")
    output[/"Parquet file(s)<br>written to output_dir"/]

    %% Edges
    opts_project_dir --> list_sas_files -->|Select one path| path --> convert
    output_dir & chunk_size --> convert
    convert --> output
```

### Converting multiple registers in parallel

```{mermaid}
%%| label: fig-targets-flow
%%| fig-cap: "Expected workflow for converting multiple registers using the targets pipeline."
%%| fig-alt: "A flowchart showing the expected flow of converting register SAS files to Parquet files using the provided targets pipeline template."
flowchart TD
    copy_pipeline("use_template()")
    edit["Edit _targets.R as needed"]
    run_pipeline("targets::tar_make()")
    output[/"Parquet file(s)<br>written to directory<br>specified in _targets.R"/]

    %% Edges
    copy_pipeline --> edit --> run_pipeline --> output

    %% Style
    style edit fill:#FFFFFF, color:#000000, stroke-dasharray: 5 5
```

### Reading Parquet files

fastreg provides three ways to read Parquet registers depending on the
use case.

`read_register()` is the main read function. We wanted a function that
could make it really easy to use and read in a particular register (with
data from all available years if it is in a partitioned Parquet format).
For example, to read in `bef` (population register) as a DuckDB table,
we wanted it as simple as `read_register("bef")`. It should
automatically find the relevant Parquet dataset (as partition) and read
them in as a single DuckDB table.

```{mermaid}
%%| label: fig-flow-read-register
%%| fig-cap: "Expected workflow for reading a Parquet register as a DuckDB table using `read_register()`."
%%| fig-alt: "A flowchart showing the expected flow of reading a Parquet register created with the fastreg package."
flowchart LR
    path[/"name<br>[Character scalar]"/]
    read_register("read_register()")
    output[/"Output<br>[DuckDB table]"/]

    %% Edges
    path --> read_register --> output
```

However, we can't guarantee that the `read_register()` function will
correctly guess and/or find the register as a Parquet dataset. So we
also provide two more flexible functions: `read_parquet_dataset()` and
`read_parquet_file()`.

`read_parquet_dataset()` underlies `read_register()`, but without
guessing the path (or when the setting hasn't been set). It takes a
direct path to the Parquet dataset (the directory containing the
Hive-partitioned Parquet files), applies some settings to more smoothly
read in the datasets, and reads it as a DuckDB table. This function can
be used if `read_register()` failed to correctly read the right dataset.

`read_parquet_file()` is the simplest read function. It takes a direct
path to a `.parquet` file (not a partitioned dataset) and reads it as a
DuckDB table. This can be used if the register isn't in a partitioned
format.

### List SAS and Parquet files

To help with management as well as discovery of available registers, we
also provide helper functions to list the available SAS and Parquet
files and partitioned datasets.

`list_parquet_files()` takes the directories given within the settings
and lists all Parquet files found within those directories that follow
the `part-*.parquet` pattern. If no setting is given, the project ID
will be guessed from the working directory path and the default location
will be the `rawdata/` and `workdata/` directories (e.g.,
`E:/rawdata/<project-id>/` on DST). If those locations are different
than the expected default, the setting must be set. That way, users can
use `list_parquet_files()` without any arguments and it will
automatically find and list all the Parquet files within the project. We
decided to look in both `rawdata/` (where the original SAS files are
also kept) as well as `workdata/` because some projects have managers
with access to saving files (like Parquet files) to `rawdata/` but other
projects don't, so they need to save files in `workdata/`.

`list_parquet_datasets()` builds on top of `list_parquet_files()`. It
takes the output of `list_parquet_files()`, goes to the Parquet
partition root (hard-coded to two levels back, before the folders with
`year=`), and lists all the datasets. We use this function internally in
`read_register()` as a check to see whether the register name provided
by the user matches any of the available Parquet datasets. But this
function is also useful to interactively discover the different Parquet
datasets that are available within the project.

`list_sas_files()` lists all SAS files found within the `rawdata/`
directory set in the settings. We only look in `rawdata` because DST
stores the original SAS files there. Like `list_parquet_files()`, if the
setting isn't set, it will also guess the project ID and look in the
`rawdata/` of that project for any SAS files.

## Conversion log

The purpose of the conversion log is to describe the details of the
conversion to provide an audit trail. Since we can't be sure that the
SAS files within the same register contain exactly the same columns and
data types, the conversion log helps identify any differences between
these files. It also includes any warnings produced by the targets
pipeline.

::: callout-note
Discrepancies (different columns or incompatible data types) between
files within the same register do not stop the conversion, but they are
included in the log.
:::

`convert()` returns a metadata tibble with one row per written chunk.
The returned tibble can be queried with dplyr directly or rendered into
a Quarto log.

The tibble has the following format:

| Column        | Description                                  |
|---------------|----------------------------------------------|
| `input_path`  | Path to the source SAS file                  |
| `output_path` | Path to the written Parquet part file        |
| `row_count`   | Number of rows in the chunk                  |
| `schema`      | Nested tibble with columns `name` and `type` |

The information is derived from the chunk already in memory, not by
reading the converted Parquet file, so the schema reflects the types as
read by `{haven}` rather than as stored in Parquet.

`use_template()` copies both a targets pipeline, `_targets.R`, and a
conversion log template, `conversion-log.qmd`, into the current working
directory. The conversion log (a PDF) is created as the final step of
the targets pipeline.

The conversion log has the following sections:

- A table of contents providing an overview of the converted registers.
- A warnings section, if the targets pipeline produced any warnings.
- One section per converted register, consisting of:
  - A subsection listing each Parquet chunk and its row count.
  - A subsection showing the most common schema and how many converted
    files share it.
  - A subsection showing schema differences, if any occur.

If you want to customise the log, e.g., the output format or the
sections, you can edit the `conversion-log.qmd`.
