---
title: "Making parsers with higher order functions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Making parsers with higher order functions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: parcr.bib
csl: molecular-microbiology.csl
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Goal of the package

The goal of this package is to simplify the creation of transparent parsers for
structured text files generated by machines like laboratory instruments. For 
example, we use the package to construct parsers for files generated by 
[plate readers](https://en.wikipedia.org/wiki/Plate_reader). The data generated 
by these instruments can usually be exported to text or spreadsheet files. Such
files consist of lines of text organized in higher-order structures like 
headers with metadata and blocks of measured values, _etc._. It's often 
convenient to analyze the data in a program like R. To be able to do that you
have to have a parser that processes these files and creates R-objects as 
output. The `parcr`package simplifies the task of creating such parsers.

## Higher order functions in R

The parsers that are created with this package make extensive use of
functional programming. If this topic is new to you then please read about
[Functional Programming](https://adv-r.hadley.nz/fp.html), in particular the 
chapters *Function Factories* and *Function Operators*, in Hadley Wickham's
book [**Advanced R**](https://adv-r.hadley.nz).

## Creating parser combinators in R to parse text files

The `parcr` package contains a set of functions that allow you to create simple
parsers with higher order functions, functions that can take functions as input 
and have functions as output. These are sometimes called combinators. The 
ideas behind the package are described in a paper by @Hutton1992. A number of 
the functions described in this paper are implemented with modifications in the
current package. The package was also heavily inspired by the 
[`Ramble`](https://github.com/NoRaincheck/Ramble) package which is written in 
the same vein, but without the explicit parsing of structured text files in
mind.

### The output of the parsers: a `list`

The parsers constructed with the functions from this package generate a `list`
as an output. The parsers read the input vector from left to right.  When the
parser fails the output will be the empty list `list()`. However, if the 
parser is successful then it produces a `list` with two elements. An element 
called `L` (the *left* part) contains the output generated by that part of the
input vector which was successfully parsed, and the element called `R` 
(the *right* part) which contains the remainder that was not parsed. When an
entire  character vector is parsed the content of the `R` element equals 
`character(0)`. The content of the `L` part can be shaped to your desire. This
is demonstrated in the example of the *fasta* file parser later in this 
document.

## A simple example of using parser combinators

Please realize that every function described below is a higher order function:
their output is a function. In its turn, this function can take a character
vector as its input. For example, `literal("a")` yields a function. To
use that function as a parser you have to provide it with a character vector, 
the object that needs to be parsed, as input:

```
literal("a")(c("a","att"))
```

This parser tests whether the next element in its input is literally the string
"a". It will succeed in the example above, but will only consume the first 
element of its input and then stop. However, you can also use a higher order 
function like `literal("a")` as input to other higher order functions to create 
more complex parsers. For example, the function `then` takes two parsers `p1`
and `p2` as arguments, `then(p1, p2)` and applies them in sequence to the input.
In the `parcr` package the function is implemented in the infix form `%then%` 
which makes parser constructs better readable. The composite parser:

```
literal("a") %then% literal("att")
```

looks for an element with string "a" followed by an element with string "att".
Its application to the same vector:

```
(literal("a") %then% literal("att"))(c("a","att"))
```

will completely consume the input. In this way, using a number of standard 
parsers defined in the package, you can quickly construct flexible parsers 
taking complex input. Furthermore, the functions also allow you to construct
a desired R object as output while parsing.

## The functions in the `parcr` package

We will now discuss all of the parser combinator functions present in the 
package. You should also study their help pages. In particular the 
**Pseudocode** listed for each of them should help you to understand their
properties.

## The fundamental parsers

The six fundamental parsers allow you to construct a parser that will 
completely consume input, or to fail when the input does not satisfy the 
specifications of the parser.

### Succeeding and failing

```{r setup, echo=FALSE}
library(parcr)
```

- `succeed(o)`: where `o` is any kind of R-object
- `fail()`

The `succeed` and `fail` parsers are the nuts and bolts of a parser 
construction. The `succeed` parser always succeeds, without consuming any 
input, whereas the `fail` parser always fails.

The `succeed` parser constructs a `list` object with a 'left' or `L`-element
that contains the parsed result of the consumed part of the input vector and
the 'right' or `R`-element that contains the unconsumed part of the vector. The
`L`-element can contain any R-object that is constructed during parsing.

While `succeed` never fails, `fail` always does, regardless of the input
vector. To signal failure it returns a special form of the empty list `list()`,
namely a `marker` object printed as the icon `[]`.

**Important**: It is unlikely that you will ever use these two functions to 
construct parsers.

**Examples:**

```{r}
succeed("A")("abc")
succeed(data.frame(title="Keisri hull", author="Jaan Kross"))(c("Unconsumed","text"))
```

```{r}
fail()("abc")
```

### Parsers for the current element

The basic functions for recognizing the content of the current element, the 
left-most element of the input vector.

- `literal(c)`: tests whether the current element equals the string `c`.
- `satisfy(b)`: tests whether the current element satisfies function `b()` where
                `b()` is a logical function: it takes a string as input
                and returns `TRUE` or `FALSE`.
- `eof()`     : tests whether the input is at its end.

`eof()` is a special function that detects the end of a character vector, or if 
that character vector represents the lines of text file, the end of the file 
(EOF). In fact, it detects `character(0)` in the input vector and when 
successful it turns the `R`-side of the output into an empty list (`list()`) 
to signal that the end of the vector was detected.

**Examples:**

```{r}
literal('abc')(c('abc','def'))
```

```{r}
starts_with_a <- function(x) {grepl("^a", x)}
satisfy(starts_with_a)(c('abc','def'))
```

And here is an example of an unsuccessful parser:

```{r}
literal('a')(c('ab','a'))
```

An application of `eof()` to detect that we parsed the input completely.

```{r}
(literal("a") %then% literal("att") %then% eof())(c("a","att"))
```

Notice how the `R`-element differs from just

```{r}
(literal("a") %then% literal("att"))(c("a","att"))
```

### The fundamental combinators

- `p1 %or% p2`: applies alternative parsers `p1` and `p2` on the current element
   and returns the result of the first successful parser, or failure when both
   fail.
- `p1 %then% p2`: applies parser `p1` on the current element and `p2` on the 
   next element.

The `%or%` combinator enables us to try alternative parsers on the current 
element, whereas the `%then%` combinator enables us to test sequences of 
elements in a character vector.

Note that `%or%` uses lazy evaluation which means that the output of `%or%`
depends on the order of `p1` and `p2`: if both would in principle succeed then
only the result of `p1` is returned.

We also have two variations of the `%then%` combinator, `%xthen%` and `%thenx%` 
which do test but then discard the result from the first or second argument:

- `p1 %xthen% p2`: where `p1` and `p2` are parsers discards the result from
  `p2`
- `p1 %thenx% p2`: where `p1` and `p2` are parsers discards the result from
  `p1`

**Examples:**

```{r}
(literal('A') %or% satisfy(starts_with_a))(c('abc','def'))
```

```{r}
(literal('A') %then% satisfy(starts_with_a))(c('A', 'abc'))
```

```{r}
(literal('>') %thenx% satisfy(starts_with_a))(c('>', 'abc'))
```

## Modifying the output of a parser

As said, the six fundamental parsers allow you to construct a parser that will 
completely consume input. However, when this parser succeeds its output will, 
apart from the fact that every element is put in a list, be
equal to the input. In general, this is not very useful if you want to use
the output in other code. Therefore, we have two functions that allow you to 
modify the output of  successful parser. The basic functions for modifying 
output of a parser are:

- `p %ret% c` : when parser `p` is successful it returns the object `c` (a 
                string or `NULL`).
- `p %using% f` : when parser `p` is successful function `f()` is applied to the
                  input and its output is stored as the result.

**Examples**

```{r}
(literal('a') %ret% "We have an 'a'")(c('a','b'))
```

```{r}
(satisfy(starts_with_a) %using% toupper)(c('abc','d'))
```


## Derived parsers

Derived parsers are constructed from the six fundamental parsers.

### Iterators

- `zero_or_one(p)`: where `p` is a parser.
- `zero_or_more(p)`: where `p` is a parser.
- `one_or_more(p)`: where `p` is a parser.
- `exactly(n,p)`: where `n` is an integer and `p` is a parser.
- `match_n(n,p)`: where `n` is an integer and `p` is a parser.

`zero_or_one`, `zero_or_more` and `one_or_more` do exactly what their names 
suggest. You should realize that these are greedy parsers: they consume as many
as possible strings that can be successfully parsed by `p`. Similarly, 
`exactly` is a greedy parser, and it fails when there are less or more than `n`
consecutive strings that can be successfully parsed by `p`. On the other hand
`match_n` is not greedy. It consumes `n` but no more strings that can be 
successfully parsed with `p`.

**Examples:**

This parser will fail on its input, too many strings starting with "a":
```{r}
zero_or_one(satisfy(starts_with_a))(c('acc','aat','cgg'))
```


The following is a successful parse. Note that its result is not merely
`[]` which would have indicated failure, but an `L,R`-list with an empty list
in the `L`-element.
```{r}
zero_or_more(satisfy(starts_with_a))(c('cat','gac','cct'))
```

```{r}
one_or_more(satisfy(starts_with_a))(c('att','aac','cct'))
```

```{r}
exactly(2, satisfy(starts_with_a))(c('att','aac','cct'))
```

```{r}
match_n(1, satisfy(starts_with_a))(c('att','aac','cct'))
```

### Recognizing and processing strings with `match_s`

When constructing a parser you will often need to recognize as well as process
strings. For example, you want to recognize multiple integers in a line, 
extract these and then return them as a numeric vector. You're not interested
in other elements like comments in these strings. This could be achieved by 
combining `satisfy()` and subsequently `%using%`, like:

`satisfy(has_integers) %using% process_integers`

where `has_integers` is a boolean function and `process_integers` is a function
that both recognizes , extracts and rearranges numbers to a numeric vector. You
will often find that `has_integers` and `process_integers` use the same regular 
expressions. Then it may be more efficient to combine these, which is what the
`match_s()` function does.

The `match_s()` parser takes a simple (not higher-order) function `s` to process
the string from the current element and returns the result from that function. 
The function `s` has to be constructed in such a way that it returns the empty 
`list()` when the string does not satisfy the criteria that the user sets.

**Example:**

Here `numbers` is a function hat recognizes and returns numbers (in fact, 
positive integers) in a string:

```{r}
numbers <- function(x) {
  m <- gregexpr("[[:digit:]]+", x)
  matches <- as.numeric(regmatches(x,m)[[1]])
  if (length(matches)==0) {
    return(list()) # we signal parser failure when no numbers were found
  } else {
    return(matches)
  }
}

match_s(numbers)(" 101 12 187 # a comment on these numbers")
```


### Functions that split a string and parse the substrings

- `by_split(p, split, finish = TRUE, fixed = FALSE, perl = FALSE)`: where `p` 
   is a parser
- `by_symbol(p, finish = TRUE)`: where `p` is a parser

Although you can use the string processing functions from the `base` or 
`stringr` packages to parse and process individual elements of a character 
vector it is also possible to parse substrings by first splitting a string.
`by_split` uses a `split` pattern to first split the incoming string and then
applies the parser `p` to it. `by_symbol` splits the incoming string to
individual symbols and then applies the parser `p`. The `finish` boolean
indicates whether the parser should completely consume the split string.

Under the hood these functions use the function `strsplit()` and its `split`,
`fixed` and `perl` arguments are passed on.

**Examples:**

```{r}
starts_with_a <- function(x) grepl("^a",x[1])
# don's forget to use satisfy(), it turns starts_with into a parser
by_split(one_or_more(satisfy(starts_with_a)), ",", fixed = TRUE)("atggc,acggg,acttg")
```
```{r}
by_symbol(literal(">") %thenx% one_or_more(literal("b")), finish = FALSE)(">bb")
```

**Note**: Parsers become **slow** when using these two functions extensively.
If that bothers you then you should use the `match_s` or `satisfy()`and 
`%using%` parsers together with string processing functions like `grepl` and 
`grep` or the ones from `stringr` to process strings. Those parsers will be 
much faster.

### Derived functions to recognize and modify empty lines

- `EmptyLine()`
- `Spacer()`
- `MaybeEmpty()`

The function `EmptyLine()` detects and returns empty line. Empty lines are 
either the string `""` or strings consisting entirely of space-like characters
as identified by the regular expression `\\s`. `Spacer()` detects one or more
consecutive empty lines and discards these whereas `MaybeEmpty()` detects zero
or more empty lines and discards these.

An additional function `Ignore()` ignores all lines, whether empty or not, until
the end of the file. This is sometimes useful when the interesting part of a 
file has been parsed and all else can be ignored until the end of the file.

Note that I write these functions with capital letters. I use this convention 
here and in the example below to indicate that these functions parse 
higher-order structures (higher than the individual strings) in the input.

**Examples:**

```{r}
EmptyLine()("")
```

```{r}
Spacer()(c(" ","\t\t\t", "atgcc"))
```

```{r}
MaybeEmpty()(c("ggacc","gatccg", "atgcc"))
```

```{r}
(literal("Interesting") %then% 
  Ignore() %then% 
  eof())(c("Interesting", LETTERS))
```

## Example application: a parser for *fasta* sequence files

As an example of a somewhat realistic application let's try to write a parser
for fasta-formatted files for mixed nucleotide and protein sequences.

Such a fasta file could look like the example below

```{r, echo=FALSE, comment = NA}
data("fastafile")
cat(paste0(fastafile, collapse="\n"))
```

where the first two are nucleotide sequences and the last is a protein 
sequence[^1].

[^1]: It is not clear to me whether mixing of sequence types is allowed in the 
  fasta format. I guess not, because a protein sequence consisting entirely of
  glutamate (G), alanine (A), threonine (T) and cysteine (C) would not be 
  distinguishable from a nucleotide sequence. Such protein sequences would be 
  extremely rare. Anyway I demonstrate here that apart from this ambiguous case
  it is easy to parse them from a single file.

Since fasta files are text files we could read such a file using `readLines()`.
Below we simulate the result of reading the file above by loading the 
`nuclfasta` and `protfasta` data sets present in the package. The consist of 
character vectors.

```{r,eval=FALSE}
data("fastafile")
```

We can distinguish the following higher order components in a fasta file:
 
- A **fasta** file: consists of one or more **sequence blocks** until the 
  **end of the file**.
- A **sequence block**: consist of a **header**[^2] and a 
  **nucleotide sequence** or a **protein sequence**. A sequence block could be
  preceded by zero or more **empty lines**.
- A **nucleotide sequence**: consists of one or more 
  **nucleotide sequence strings**.
- A **protein sequence**: consists of one or more 
  **protein sequence strings**.
- A **header** is a *string* that starts with a ">" immediately followed by
  a **title** without spaces.
- A **nucleotide sequence string** is a *string* without spaces that consists
  *entirely* of symbols from the set `{G,A,T,C}`.
- A **protein sequence string** is a *string* without spaces that consists
  *entirely* of symbols from the set `{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}`.

[^2]: Note that real fasta headers and sequences can have more complicated
      formats than I pretend here.

It now becomes clear what I mean when I say that the package allows us to write
transparent parsers: the description above of the structure of fasta files can
be translated straight into code for a `Fasta()` parser:

```{r}
Fasta <- function() {
  one_or_more(SequenceBlock()) %then%
    eof()
}

SequenceBlock <- function() {
  MaybeEmpty() %then% 
    Header() %then% 
    (NuclSequence() %or% ProtSequence())
}

NuclSequence <- function() {
  one_or_more(NuclSequenceString())
}

ProtSequence <- function() {
  one_or_more(ProtSequenceString())
}

```

Notice that these elements are functions taking no input, hence the empty 
argument brackets `()` behind their names. They can take input when needed,
for example to change their behavior (like `match_n()`, or see the other 
example below).

Now we need to define the string-parsers `Header()`, `NuclSequenceString()` 
and `ProtSequenceString()` that recognize and process these elements in the 
character vector `fastafile`. We use functions from `stringr` to do this in 
three helper functions, and we use `match_s()` to create parsers from these.

```{r}
# returns the title after the ">" in the sequence header
parse_header <- function(line) {
  # Study stringr::str_match() to understand what we do here
  m <- stringr::str_match(line, "^>(\\w+)")
  if (is.na(m[1])) {
    return(list()) # signal failure: no title found
  } else {
    return(m[2])
  }
}

# returns a nucleotide sequence string
parse_nucl_sequence_line <- function(line) {
  # The line must consist of GATC from the start (^) until the end ($)
  m <- stringr::str_match(line, "^([GATC]+)$")
  if (is.na(m[1])) {
    return(list()) # signal failure: not a valid nucleotide sequence string
  } else {
    return(m[2])
  }
}

# returns a protein sequence string
parse_prot_sequence_line <- function(line) {
  # The line must consist of ARNDBCEQZGHILKMFPSTWYV from the start (^) until the
  # end ($)
  m <- stringr::str_match(line, "^([ARNDBCEQZGHILKMFPSTWYV]+)$")
  if (is.na(m[1])) {
    return(list()) # signal failure: not a valid protein sequence string
  } else {
    return(m[2])
  }
}
```

Then we define the parsers.

```{r}
Header <- function() {
  match_s(parse_header)
}

NuclSequenceString <- function() {
  match_s(parse_nucl_sequence_line)
}

ProtSequenceString <- function() {
  match_s(parse_prot_sequence_line)
}
```

Now we have all the elements that we need to apply the `Fasta()` parser.

```{r}
Fasta()(fastafile)
```
Apart from `match_s()`, we have used only the six fundamental parsers. Therefore,
the output is almost the same as the parsed input. This is not very useful 
because it is difficult to extract the individual sequences and titles from it;
we would have to write sort of parser again to process this output. To mend 
this, we have to modify the output of the parsers. The first thing that we will
do is to let every sequence block be returned as an element of a list. To 
achieve this we extend the `SequenceBlock` parser by changing its output with
the `%using%` operator:

```{r}
SequenceBlock <- function() {
  MaybeEmpty() %then% 
    Header() %then% 
    (NuclSequence() %or% ProtSequence()) %using%
    function(x) list(x)
}
```

Now the result is a list of three lists, one for each sequence block.

```{r}
Fasta()(fastafile)[["L"]]
```
In principle, this output is easier to extract information from, but we can
improve on it. First, we want the sequences to appear as one long string, not
as separate character vectors corresponding to the lines in the sequence block. 
Therefore, we extend the `NuclSequence` and `ProtSequence` parsers collapsing
their output:

```{r}
NuclSequence <- function() {
  one_or_more(NuclSequenceString()) %using% 
    function(x) paste0(x, collapse = "")
}

ProtSequence <- function() {
  one_or_more(ProtSequenceString()) %using% 
  function(x) paste0(x, collapse="")
}
```

Then we get

```{r}
Fasta()(fastafile)[["L"]]
```

This looks much better: we know that the first element in each of these lists
is the title and the second element is the complete sequence. Then why not 
just attach a name to these elements? This would make extracting the 
information even easier. Furthermore, we also report whether the sequence is a
nucleotide or a protein sequence by adding a `type` tag.

```{r}
Header <- function() {
  match_s(parse_header) %using% 
    function(x) list(title = unlist(x))
}

NuclSequence <- function() {
  one_or_more(NuclSequenceString()) %using% 
    function(x) list(type = "Nucl", sequence = paste0(x, collapse=""))
}

ProtSequence <- function() {
  one_or_more(ProtSequenceString()) %using% 
    function(x) list(type = "Prot", sequence = paste0(x, collapse=""))
}
```

Finally, we have our desired output.

```{r}
d <- Fasta()(fastafile)[["L"]]
d
```

Let's present the result more concisely using the names of these elements:

```{r}
invisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, "\n")}))
```

## Example application: parsers with parameters

In the examples above we showed how to create parsers without parameters. It is easy and useful to sometimes create parsers with parameters. The parameters are used to change the behavior of the parsers. For example, when writing online course material I use a simple structured question template that is converted to html when the syllabus is generated. It consists mostly of markdown content. Its parser makes use of parametrized parsers. The structure of such a question template document is as follows[^3]:

[^3]: I simplified the template and code for this example. In fact the content is processed differently depending on the type of element, meaning that `Content()` is a function of `type`. Furthermore, questions are automatically numbered.

```{r, echo=FALSE}
qtemp <- c(
  "#### INTRO",
  "## Title about a set of questions",
  "",
  "This is optional introductory text to a set of questions.",
  "Titles preceded by four hashes are not allowed in a question template.",
  "",
  "#### QUESTION",
  "This is the first question",
  "",
  "#### TIP",
  "This would be a tip. tips are optional, and multiple tips can be given. Tips are",
  "wrapped in hide-reveal style html elements.",
  "",
  "#### TIP",
  "This would be a second tip.",
  "",
  "#### ANSWER",
  "The answer to the question is optional and is wrapped in a hide-reveal html element.",
  "",
  "#### QUESTION",
  "This is the second question. No tips for this one",
  "",
  "#### ANSWER",
  "Answer to the second question"
)
```

```{r, echo=FALSE, comment=NA}
cat(paste0(c(qtemp,"","<optionally more questions>"), collapse="\n"))
```

I stored this example content in a vector `qtemp` to parse it later.

You notice the recurring structure of a header with four hashes `####` and some text following it. These headers represent four types of elements: intro, question, tip and answer. Instead of writing separate parsers we could 
create a generic parser for such elements as:

```{r}
HeaderAndContent <- function(type) {
    (Header(type) %then% Content()) %using% 
    function(x) list(list(type=type, content=unlist(x)))
}
```

Then we define each of the four parsers as:

```{r}
Intro <- function() HeaderAndContent("intro")
Question <- function() HeaderAndContent("question")
Tip <- function() HeaderAndContent("tip")
Answer <- function() HeaderAndContent("answer")
```

The function `Header(type)` is defined as

```{r}
Header <- function(type) satisfy(header(type)) %ret% NULL

# This must also be a generic function: a function that generates a function to 
# recognize a header of type 'type'
header <- function(type) {
  function(x) grepl(paste0("^####\\s+", toupper(type), "\\s*"), x)
}
```

The content consists of one or more lines not starting with `####`, which 
includes empty lines. We discard trailing empty lines.

```{r}
Content <- function() {
  (one_or_more(match_s(content))) %using%
    function(x) stringr::str_trim(paste0(x,collapse="\n"), "right")
}

content <- function(x) {
  if (grepl("^####", x)) list()
  else x
}
```

The complete template is defined as follows

```{r}
Template <- function() {
  zero_or_more(Intro()) %then%
    one_or_more(QuestionBlock()) %then%
    eof()
}
```

where `QuestionBlock()` is defined using the previously defined elements as

```{r}
QuestionBlock <- function() {
    Question() %then%
    zero_or_more(Tip()) %then%
    zero_or_one(Answer()) %using%
    function(x) list(x)
}
```

We can now parse the input. We wrap the `Template()` parser in the `reporter()` 
function to have proper error messaging and warnings, if applicable. Furthermore
only the `L`-element, the parsed input, is returned.

```{r}
reporter(Template())(qtemp)
```

## Literature