---
title: "Using mdbplyr"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using mdbplyr}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

This repository contains `mdbplyr`, an R package that provides a disciplined,
lazy `dplyr`-style interface for MongoDB aggregation pipelines.

`mdbplyr` sits between raw `mongolite` usage and broader compatibility layers.
Compared with writing aggregation JSON by hand, it lets you express supported
queries with tidy verbs and inspect the generated pipeline before execution.
Compared with approaches that try to hide MongoDB behind table-like semantics,
it stays explicit about scope, keeps translation conservative, and fails clearly
when a verb or expression is outside the supported subset.

The practical advantages are:

- lazy query construction until `collect()`,
- cursor-based streaming with `cursor()`,
- direct translation to MongoDB aggregation stages,
- `show_query()` for inspectable pipeline output,
- explicit unsupported-feature errors instead of silent fallback,
- a smaller, testable semantic surface aligned with MongoDB rather than SQL.

# Usage examples

## Loading data

Suppose there is a running MongoDB instance on `localhost` with default port
and no authentication. The code below loads `dplyr::starwars` into a collection
named `starwars`.

```{r load-starwars, eval = FALSE}
library(dplyr)
library(mongolite)
library(mdbplyr)

starwars_collection <- mongolite::mongo(
  collection = "starwars",
  db = "mdbplyr"
)

starwars_collection$drop()
starwars_collection$insert(dplyr::starwars)

starwars_tbl <- tbl_mongo(
  starwars_collection,
  schema = names(dplyr::starwars)
)
```

Once the collection is loaded, `starwars_tbl` is the lazy table used in the
examples below.

```{r create-lazy-table, eval = FALSE}
library(dplyr)
library(mdbplyr)

starwars_collection <- mongolite::mongo(
  collection = "starwars",
  db = "mdbplyr"
)

starwars_tbl <- tbl_mongo(starwars_collection) %>% 
  infer_schema()
```

## Schema

`mdbplyr` uses a schema to know which fields are available in a collection.
This matters especially for:

- dotted paths in nested documents such as `` `message.timestamp` `` or `` `message.measurements.Fx` ``;
- schema-first expression translation in verbs such as `filter()`, `mutate()`, and `summarise()`;
- explicit operations on nested data such as `select()`, `flatten_fields()`, and `unwind_array()`.

The most reliable approach is to pass `schema = ...` explicitly when creating
the lazy table:

```{r schema-explicit, eval = FALSE}
starwars_tbl <- tbl_mongo(
  starwars_collection,
  schema = c("name", "species", "height", "mass", "homeworld")
)
```

When writing the schema by hand is inconvenient, `infer_schema()` can populate
it from the first document in the collection:

```{r schema-infer, eval = FALSE}
starwars_tbl <- tbl_mongo(starwars_collection) |>
  infer_schema()
```

This is convenient for exploratory work, but it has an important limitation:
it only sees one document. If the collection is heterogeneous, fields that do
not appear in the first document may still need to be added manually.

You can inspect the currently known fields with:

```{r schema-fields, eval = FALSE}
schema_fields(starwars_tbl)
```

## Basic inspection

Inspect the known schema and the generated pipeline without executing the
query.

```{r inspect, eval = FALSE}
schema_fields(starwars_tbl)

starwars_tbl |>
  filter(species == "Human", height > 180) |>
  select(name, height, mass) |>
  show_query()
```

## Streaming with `cursor()`

When you want a MongoDB iterator instead of an eagerly materialized tibble,
open the lazy query with `cursor()`.

```{r inspect-cursor, eval = FALSE}
iter <- starwars_tbl |>
  filter(species == "Human", height > 180) |>
  select(name, height, mass) |>
  cursor()

iter$page(10)
```

# Supported verbs

Each subsection below shows one of the supported `dplyr`-like verbs on the
`starwars` collection.

## `filter()`

```{r verb-filter, eval = FALSE}
starwars_tbl |>
  filter(species == "Droid", height > 100) |>
  collect()
```

## `select()`

```{r verb-select, eval = FALSE}
starwars_tbl |>
  select(name, species, homeworld) |>
  collect()
```

Selecting dotted paths preserves nested MongoDB structure by default. It does
not flatten nested fields unless you explicitly ask for that:

```{r verb-select-nested, eval = FALSE}
sensor_tbl |>
  select(`message.timestamp`, `message.measurements`) |>
  collect()
```

## `rename()`

```{r verb-rename, eval = FALSE}
starwars_tbl |>
  rename(character_name = name, planet = homeworld) |>
  collect()
```

## `mutate()`

```{r verb-mutate, eval = FALSE}
starwars_tbl |>
  mutate(height_m = height / 100, bmi_like = mass / (height_m * height_m)) |>
  select(name, height, mass, height_m, bmi_like) |>
  collect()
```

## `transmute()`

```{r verb-transmute, eval = FALSE}
starwars_tbl |>
  transmute(name = name, height_m = height / 100) |>
  collect()
```

## `arrange()`

```{r verb-arrange, eval = FALSE}
starwars_tbl |>
  arrange(desc(height), name) |>
  select(name, height) |>
  slice_head(n = 10) |>
  collect()
```

## `group_by()`

```{r verb-group-by, eval = FALSE}
starwars_tbl |>
  group_by(species)
```

## `summarise()`

```{r verb-summarise, eval = FALSE}
starwars_tbl |>
  group_by(species) |>
  summarise(
    n = n(),
    avg_height = mean(height),
    max_mass = max(mass)
  ) |>
  arrange(desc(n)) |>
  collect()
```

## `slice_head()`

```{r verb-slice-head, eval = FALSE}
starwars_tbl |>
  select(name, species) |>
  slice_head(n = 5) |>
  collect()
```

## `slice_tail()`

```{r verb-slice-tail, eval = FALSE}
starwars_tbl |>
  select(name, species) |>
  slice_tail(n = 5) |>
  collect()
```

## `head()`

```{r verb-head, eval = FALSE}
head(starwars_tbl, 5) |>
  collect()
```

## `flatten_fields()`

Use `flatten_fields()` when you explicitly want nested object leaves to become
flat tibble columns. By default the output names are the schema dot paths.

```{r verb-flatten-fields, eval = FALSE}
sensor_tbl |>
  select(`message.timestamp`, `message.measurements`) |>
  flatten_fields() |>
  collect()
```

You can also target a specific nested root and optionally rename the flattened
output columns:

```{r verb-flatten-fields-names, eval = FALSE}
sensor_tbl |>
  flatten_fields(
    `message.measurements`,
    names_fn = function(x) gsub(".", "_", x, fixed = TRUE)
  ) |>
  collect()
```

## `unwind_array()`

Use `unwind_array()` when a document field contains an array and you want one
output row per array element.

```{r verb-unwind-array, eval = FALSE}
orders_tbl |>
  unwind_array(items) |>
  collect()
```

If array elements are nested objects, `unwind_array()` and `flatten_fields()`
can be chained:

```{r verb-unwind-array-flatten, eval = FALSE}
orders_tbl |>
  unwind_array(items) |>
  flatten_fields(items) |>
  collect()
```

# New in v0.4.0

The features in this section were added in `mdbplyr` v0.4.0. Each heading is
marked **(New from v0.4.0)** so it is easy to tell them apart from the
longer-standing subset above.

## Server version awareness (New from v0.4.0)

`mdbplyr` now records the MongoDB server version when it can probe it, and some
features are gated on a minimum version (for example `median()` needs 7.0+ and
window functions need 5.0+). When the version cannot be determined it is left
unknown and the feature is allowed through, so the server still has the final
say.

```{r new-server-version, eval = FALSE}
# Probed automatically from a live connection, or set it explicitly:
starwars_tbl <- tbl_mongo(
  starwars_collection,
  schema = names(dplyr::starwars),
  server_version = "7.0"
)

mongo_server_version(starwars_tbl)
```

## Computed `group_by()` keys (New from v0.4.0)

Grouping is no longer limited to bare field names. A **named** computed key is
translated into the MongoDB `$group._id`.

```{r new-group-computed, eval = FALSE}
starwars_tbl |>
  group_by(height_band = floor(height / 50) * 50) |>
  summarise(n = n(), avg_mass = mean(mass)) |>
  arrange(height_band) |>
  collect()
```

## More `summarise()` accumulators (New from v0.4.0)

`summarise()` gained `sd()`, `var()`, `first()`, `last()`, and `n_distinct()`,
alongside the original `n()`, `sum()`, `mean()`, `min()`, and `max()`.

```{r new-summaries, eval = FALSE}
starwars_tbl |>
  group_by(species) |>
  summarise(
    n = n(),
    sd_height = sd(height),
    var_mass = var(mass),
    first_name = first(name),
    distinct_worlds = n_distinct(homeworld)
  ) |>
  collect()
```

`median()` and `quantile(probs = ...)` are also supported, but compile to
`$percentile` and therefore require **MongoDB 7.0+**:

```{r new-median, eval = FALSE}
starwars_tbl |>
  group_by(species) |>
  summarise(
    median_height = median(height),
    p90_mass = quantile(mass, probs = 0.9)
  ) |>
  collect()
```

## tidyselect helpers in `select()` (New from v0.4.0)

When the schema is known, `select()` accepts name-based tidyselect helpers
(`starts_with()`, `ends_with()`, `contains()`, `matches()`, `everything()`,
`all_of()`, `any_of()`), ranges, and negation. (`where()` is intentionally not
supported, because column types are unknown without reading data.)

```{r new-select-tidyselect, eval = FALSE}
starwars_tbl |>
  select(name, starts_with("home"), ends_with("_color")) |>
  collect()

# Drop columns with negation, or select from a character vector:
starwars_tbl |>
  select(-ends_with("_color")) |>
  collect()

cols <- c("name", "height", "mass")
starwars_tbl |>
  select(all_of(cols)) |>
  collect()
```

Pure bare-name selections still work exactly as before, including nested root
paths such as `` `message.measurements` ``.

## `coalesce()` (New from v0.4.0)

`coalesce()` compiles to `$ifNull`, returning the first non-missing value.

```{r new-coalesce, eval = FALSE}
starwars_tbl |>
  mutate(world = coalesce(homeworld, "unknown")) |>
  select(name, world) |>
  collect()
```

## `across()` (New from v0.4.0)

`across()` works in both `mutate()` and `summarise()`. It expands a column
selection and a function spec into one assignment per (column, function) pair.
`.fns` accepts a bare function name, a `~` lambda (using `.x`), or a named list
of those, and `.names` glue (`{.col}`, `{.fn}`) controls the output names.

```{r new-across-summarise, eval = FALSE}
# One aggregate over several columns
starwars_tbl |>
  group_by(species) |>
  summarise(across(c(height, mass), mean, na.rm = TRUE)) |>
  collect()

# A named list of aggregates -> {.col}_{.fn} output names
starwars_tbl |>
  group_by(species) |>
  summarise(across(c(height, mass), list(avg = mean, max = max))) |>
  collect()
```

```{r new-across-mutate, eval = FALSE}
# A lambda applied to each selected column, with custom output names
starwars_tbl |>
  mutate(across(c(height, mass), ~ .x / 100, .names = "{.col}_scaled")) |>
  select(name, height_scaled, mass_scaled) |>
  collect()
```

## Window functions (New from v0.4.0)

Window functions in `mutate()` / `transmute()` compile to MongoDB
`$setWindowFields` and therefore require **MongoDB 5.0+**. They partition by the
current `group_by()` keys.

Ranking functions take the column to order by (optionally wrapped in `desc()`):

```{r new-window-rank, eval = FALSE}
starwars_tbl |>
  group_by(species) |>
  mutate(height_rank = min_rank(desc(height))) |>
  select(name, species, height, height_rank) |>
  collect()
```

Cumulative and offset windows take their order from a preceding `arrange()`
(and the stage reorders the output by that key):

```{r new-window-cumulative, eval = FALSE}
starwars_tbl |>
  arrange(height) |>
  mutate(cumulative_mass = cumsum(mass)) |>
  select(name, height, mass, cumulative_mass) |>
  collect()
```

```{r new-window-offset, eval = FALSE}
starwars_tbl |>
  arrange(height) |>
  mutate(
    prev_height = lag(height),
    next_height = lead(height, default = 0)
  ) |>
  select(name, height, prev_height, next_height) |>
  collect()
```

`row_number()` is available as an alias for the existing `1:n()` row-numbering.

## Joins (New from v0.4.0)

`inner_join()`, `left_join()`, `semi_join()`, and `anti_join()` compile to
MongoDB `$lookup` against another collection in the same database. The
right-hand side must be a plain `tbl_mongo` (a collection reference with a known
schema and no lazy operations).

```{r new-join-setup, eval = FALSE}
films_tbl <- tbl_mongo(
  mongolite::mongo(collection = "films", db = "mdbplyr"),
  schema = c("film_id", "title", "director")
)
```

A mutating join flattens the match by default (one row per matched pair, with
colliding columns suffixed):

```{r new-join-inner, eval = FALSE}
appearances_tbl |>
  inner_join(films_tbl, by = "film_id") |>
  collect()
```

`left_join()` keeps unmatched left rows. Filtering joins keep the left columns
and only test whether a match exists:

```{r new-join-semi, eval = FALSE}
appearances_tbl |>
  semi_join(films_tbl, by = c("movie_id" = "film_id")) |>
  collect()
```

Pass `unnest = FALSE` to keep the matched documents as a native nested array
column instead of flattening them:

```{r new-join-nested, eval = FALSE}
appearances_tbl |>
  left_join(films_tbl, by = "film_id", unnest = FALSE) |>
  collect()
```

# Notes

The examples above stay within the currently supported subset:

- explicit field references,
- schema-first expression resolution, with `.data$...` for explicit field references and `.env$...` for local values,
- scalar expressions for `mutate()` and `transmute()`, plus the special `1:n()` row-numbering case,
- conservative grouped summaries, including computed group keys and the extended accumulator set (v0.4.0),
- name-based tidyselect helpers and `across()` in `select()`/`mutate()`/`summarise()` (v0.4.0),
- window functions via `$setWindowFields` (v0.4.0),
- `inner_join()` / `left_join()` / `semi_join()` / `anti_join()` via `$lookup` against a plain right-hand `tbl_mongo` (v0.4.0),
- explicit flattening of nested object fields with `flatten_fields()`,
- array row expansion with `unwind_array()`,
- no reshaping, writes, or automatic client-side fallback.

If a query falls outside that subset, `mdbplyr` is designed to fail
explicitly rather than guess or silently change execution semantics.