---
title: "Correlation Subset Selection with corrselect"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Correlation Subset Selection with corrselect}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(corrselect)
```

## Overview

`corrselect` identifies all maximal subsets of variables whose pairwise correlations stay below a chosen threshold. This process reduces multicollinearity and redundancy before modeling, while preserving interpretability. Unlike greedy or stepwise approaches, `corrselect` exhaustively searches for all valid subsets using fast, exact algorithms. It is fully model-agnostic, making it suitable as a preprocessing step for regression, clustering, feature selection, and other analyses.

Given a threshold \( t \in (0,1) \), the functions `corrSelect()` (data-frame interface) and `MatSelect()` (matrix interface) enumerate all **maximal** subsets \( S \) of variables satisfying:

\[
\forall i, j \in S,\ i \neq j: \ |r_{ij}| < t
\]

where \(r_{ij}\) denotes the chosen correlation measure between variables \(i\) and \(j\). Enumeration relies on two exact graph-theoretic algorithms:

1. **Eppstein–Löffler–Strash (ELS)**, a degeneracy-ordered backtracking algorithm optimized for sparse graphs.  
2. **Bron–Kerbosch (BK)**, a classical recursive clique-finding method, with optional pivoting to reduce search space.

Results are returned as a `CorrCombo` S4 object containing each subset’s variable names and summary statistics (`avg_corr`, `min_corr`, `max_corr`). You can then extract subsets from the original data via `corrSubset()`. Because the procedure does not depend on any downstream model, it cleanly separates “feature curation” from “model fitting” and supports multiple correlation measures (`pearson`, `spearman`, `kendall`, `bicor`, `distance`, `maximal`).

---

## Quick Start (`CorrSelect`)

### Simulated numeric example

```{r}
set.seed(42)
n <- 100
df <- data.frame(
  A = rnorm(n),
  B = rnorm(n),
  C = rnorm(n),
  D = rnorm(n),
  E = rnorm(n)
)
df$F <- df$A * 0.9 + rnorm(n, sd = 0.1)  # strongly correlated with A
```


### Basic selection

```{r}
res <- corrSelect(df, threshold = 0.7)
res
as.data.frame(res)
```

```{r}
corrSubset(res, df, which = 1)[1:10,]
```


### Forcing variables into all subsets

```{r}
res2 <- corrSelect(df, threshold = 0.7, force_in = "A")
res2
```


### Using a different correlation method

```{r}
res3 <- corrSelect(df, threshold = 0.6, cor_method = "spearman")
res3
```


## Matrix Interface (`MatSelect`)

If you already computed a correlation matrix or want to apply the method to precomputed correlations:

```{r}
mat <- cor(df)
res4 <- MatSelect(mat, threshold = 0.7)
res4
```

Selecting subsets:

```{r}
MatSelect(mat, threshold = 0.5)
```

Force variable 1 into every subset:

```{r}
MatSelect(mat, threshold = 0.5, force_in = 1)
```

## Mixed Data Types (`assocSelect`)

```{r}
df_ass <- data.frame(
  height = rnorm(15, 170, 10),
  weight = rnorm(15, 70, 12),
  group  = factor(rep(LETTERS[1:3], each = 5)),
  score  = ordered(sample(c("low","med","high"), 15, TRUE))
)

# keep every subset whose internal associations ≤ 0.6
res5 <- assocSelect(df_ass, threshold = 0.6)
res5
```


---

## Changing Correlation Method

By default, `corrSelect()` uses Pearson correlation. You can choose alternatives with the `cor_method` argument:

- `"pearson"`: linear correlation (default)  
- `"spearman"`: rank-based monotonic association  
- `"kendall"`: Kendall’s tau  
- `"bicor"`: robust biweight midcorrelation (`WGCNA::bicor`)  
- `"distance"`: distance correlation (`energy::dcor`)  
- `"maximal"`: maximal information coefficient (`minerva::mine`)  

Example:

```{r}
res6 <- corrSelect(df, threshold = 0.7, cor_method = "spearman")
res6
```

---

## Handling Mixed Data Types

The function `assocSelect()` extends `corrSelect()` to support **mixed data types** — including numeric, factor, and ordered variables — by using appropriate association measures for each variable pair.

Instead of a single correlation matrix, it constructs a **generalized association matrix** using the following logic:

| Variable 1       | Variable 2       | Method Used    |
|------------------|------------------|----------------|
| numeric          | numeric          | `pearson` (default; customizable) |
| numeric          | factor           | `eta` |
| numeric          | ordered          | `spearman` (default; customizable) |
| factor           | factor           | `cramersv` |
| factor           | ordered          | `cramersv` |
| ordered          | ordered          | `spearman` (default; customizable) |

The defaults for numeric-numeric, numeric-ordered, and ordered-ordered associations can be changed via arguments:

```{r}
assocSelect(df_ass,
  method_num_num = "kendall",
  method_num_ord = "spearman",
  method_ord_ord = "kendall"
)
```


All other combinations use fixed methods (`eta` or `cramersv`) appropriate for measuring association strength.

### Example with Mixed Types

```{r}
df_ass <- data.frame(
  height = rnorm(10),
  weight = rnorm(10),
  group  = factor(sample(c("A", "B"), 10, replace = TRUE)),
  score  = ordered(sample(1:3, 10, replace = TRUE))
)

res7 <- assocSelect(df_ass, threshold = 1, method = "bron-kerbosch", use_pivot = TRUE)
res7
```


Each pairwise association is bounded to [0,1] and treated analogously to correlation.

---

## Theory

Given a symmetric correlation matrix \( R \in \mathbb{R}^{p \times p} \), we seek all **maximal subsets** \( S \subseteq \{1, \dots, p\} \) such that:

\[
\forall i, j \in S,\ i \neq j: \ |R_{ij}| < t
\]

for a fixed threshold \( t \in (0, 1) \).

This is equivalent to finding all **maximal cliques** in the **thresholded correlation graph**, where:

- Nodes represent variables  
- Edges connect nodes whose absolute correlation is **below** the threshold  

A maximal clique corresponds to a variable subset that cannot be extended without violating the correlation limit.

---

## Algorithms

### ELS (Eppstein–Löffler–Strash)

The ELS algorithm efficiently enumerates all maximal cliques in a sparse graph using **degeneracy ordering**:

1. Compute a degeneracy ordering \( v_1, \dots, v_p \).  
2. For each \( i \), extend current clique \( S \) from \( \{v_i\} \) within candidate set \( C = \{v_{i+1}, \dots, v_p\} \).  
3. Recursively build cliques, pruning when no further vertices can be added.

Formally, define:

\[
\text{extend}(S, C) =
\begin{cases}
S, & C = \emptyset, \\
\bigcup_{v \in C} \text{extend}(S \cup \{v\},\ C \setminus (N(v) \cup \{v\})), & \text{otherwise}.
\end{cases}
\]

ELS avoids redundant exploration, achieving good performance on typical correlation graphs.

### Bron–Kerbosch (with Pivoting)

The classical Bron–Kerbosch algorithm enumerates maximal cliques via recursive backtracking with optional pivoting:

Let \( R \) = current clique, \( P \) = prospective nodes, \( X \) = excluded nodes. Then:

\[
\text{BK}(R, P, X) =
\begin{cases}
\text{report}(R), & P = X = \emptyset, \\
\text{for each } v \in P \setminus N(u): \\
\quad \text{BK}(R \cup \{v\},\ P \cap N(v),\ X \cap N(v)), \
\quad P \leftarrow P \setminus \{v\},\ X \leftarrow X \cup \{v\}.
\end{cases}
\]

Choosing a pivot \( u \in P \cup X \) and iterating over \( P \setminus N(u) \) reduces recursive calls.

---

## Why corrselect?

Most existing R tools:

- Filter one variable at a time (e.g. `findCorrelation`)  
- Use greedy or backward-selection heuristics  
- Do not enumerate **all** valid subsets  

**corrselect** uniquely provides:

- **Exact** enumeration of maximal subsets  
- Support for multiple correlation measures  
- Optional forcing of variables  
- Full inspection via `CorrCombo` objects  
- Fast C++ implementations via Rcpp  

This makes it ideal for pipelines where **interpretability** and **completeness** are essential.

---

## Inspecting Results

Convert results for downstream use:

```{r}
df_res <- as.data.frame(res)
head(df_res)
```

Extract individual subsets:

```{r}
lapply(corrSubset(res, df, which = 1:2), function(x) head(x, 10))
```

Summarize correlation metrics:

```{r}
# Number and size of subsets
length(res@subset_list)
summary(lengths(res@subset_list))

# Summaries of within-subset correlations
summary(res@max_corr)
summary(res@avg_corr)
```

---

## CorrCombo Object Structure

A `CorrCombo` S4 object contains:

- `subset_list`: list of character vectors (variable names)  
- `avg_corr`, `min_corr`, `max_corr`: numeric vectors of correlation metrics  
- `threshold`, `forced_in`, `search_type`, `cor_method`, `n_rows_used`  
- Attribute `use_pivot` (if applicable)  

Inspect slots:

```{r}
str(res@subset_list)
```

---

## Session Info

```{r}
sessionInfo()
```
