---
title: "User guide for the noisysbmGGM package"
author: "Valentin Kilian & Fanny Villers"
date: "04/03/2024"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 2
    number_sections: true
vignette: >
  %\VignetteIndexEntry{User guide for the noisysbmGGM package}
  \usepackage[utf8]{inputenc}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: sentence
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
`noisysbmGGM` is a R package designed to perform graph inference from noisy data, and in particular to infer Gaussian Graphical Model (GGM). The package accompanies the article titled  *Enhancing the Power of Gaussian Graphical Model Inference by Modeling the Graph Structure* by Kilian, Rebafka and Villers available at <arXiv:2402.19021> . The main goal of the package is to help users to analyze complex networks and draw meaningful conclusions from noisy data.

# Introduction
The package offers two key functionalities:

1. Generation of data according to the Noisy Stochastic Block Model (NSBM) and NSBM inference:  The package  provides functions to generate data in the noisy stochastic block model, which is a statistical model that can be used when interactions between entities are described via noisy information. The package includes a greedy algorithm to fit parameters, cluster nodes, and a multiple testing procedure to detect significant interactions.

2. Gaussian Graphical Model (GGM) Inference: when observing a sample of a Gaussian vector, the package allows to infer the GGM that encodes the direct relationships between the variables.


## What kind of data ? 

### Graph inference in the NSBM : 
We observe a noisy version of a graph  between $p$ nodes, typically a $(p,p)$ matrix containing the values of a test statistic applied on all pairs of nodes. The aim is to detect the significant edges in the graph while controlling the number of false discoveries.

### GGM Inference:
In the GGM inference context, we observe a $n$-sample of a Gaussian vector of dimension $p$, and the goal is to infer the GGM that captures the direct relationships between the variables. The GGM inference function starts by computing a $(p,p)$ matrix composed of well chosen test statistics that test conditional dependencies between each pair of variables. Then the same procedure used for NSBM are employed for GGM inference purpose.

## Link with the `noisySBM` package :

`noisysbmGGM` is an improved version of certain aspects of the previous `noisySBM` package, as the new greedy algorithm is more efficient than the previous VEM algorithm to fit model parameters.

Moreover the package introduces the additional applications to GGM inference.  

To start with, we load the package:


```{r}
library(noisysbmGGM)


```

# Noisy stochastic block model

## The model

The **noisy stochastic block model (NSBM)**  is a random graph model suited for the problem of graph inference.
In this model, we assume that the observed matrix is a perturbation of an unobserved binary graph, which is the *true* graph to be inferred. The binary graph is chosen to be a stochastic block model (SBM) for its capacity to model graph heterogeneity.

The definition is stated for an undirected graph without self-loops, but extensions are straightforward. We also consider only the **Gaussian NSBM** but this model can be extended to any parametric model.

Let $p\geq 2$ be the number of nodes and $\mathcal{A}=\{(i,j): 1\leq i <j\leq p\}$ the set of all possible pairs of nodes.
Denote $Q$ the number of latent node blocks.
In the NSBM, we denote $X=(X_{i,j})_{1\leq i,j\leq p}\in\mathbb R^{p^2}$ the  symmetric, real-valued observation matrix.
The observations $X_{i,j}, (i,j)\in\mathcal{A}$, are generated by the following random layers:

1.  First, a vector $Z=(Z_1,\dots,Z_p)$ of block memberships of the nodes is generated, such that $Z_i$, $1\leq i\leq p$, are independent with common distribution given by the probabilities $\pi_q=\mathbb P(Z_1=q), q\in\{1,\dots,Q\},$ for some parameter $\pi=(\pi_q)_{q\in\{1,\dots,Q\}}\in (0,1)^Q$ such that $\sum_{q=1}^Q \pi_q=1$.

2.  Conditionally on $Z$, the variables $A_{i,j}$, $(i,j)\in\mathcal A$, are independent Bernoulli variables with parameter $w_{Z_i,Z_j}$, for some symmetric matrix $w=(w_{kl})_{k,l\in\{1,\dots,Q\}}\in (0,1)^{Q^2}$.   We denote $A=(A_{i,j})_{1\leq i,j\leq p}$ the symmetric adjacency matrix, which is a standard binary SBM.

3.  Conditionally on $(Z,A)$, the observations $X_{i,j}$, $(i,j)\in\mathcal A$ are independent and $X_{i,j}$ is sampled from the distribution
   $$X_{i,j} \sim (1-A_{i,j}) \mathcal N(0,\sigma_0^2) + A_{i,j}  \mathcal N(\mu_{Z_i,Z_j},\sigma_{Z_i,Z_j}^2)$$ 
   for some parameters $\mu=(\mu_{kl})_{k,l\in\{1,\dots,Q\}}\in \mathbb R^{Q^2}$ such that $\mu_{kl}=\mu_{lk}$,  $\sigma=(\sigma_{kl})_{k,l\in\{1,\dots,Q\}}\in (\mathbb R_+^*)^{Q^2}$ such that $\sigma_{kl}=\sigma_{lk}$ and $\sigma_0\in \mathbb R$. We suppose that $\sigma_0$ is known (often equal to $\sigma_0=1$).

The relation between $A$ and the observation $X$ is that missing edges ($A_{i,j}=0$) are replaced with pure random noise modeled by the density $\mathcal N(0,\sigma_0^2)$, also called the *null density*, whereas in place of present edges ($A_{i,j}=1$) there is a signal or effect.
The intensity of the signal depends on the block membership of the interacting nodes in the underlying SBM, modeled by the density $N(\mu_{kl},\sigma_{kl}^2)$, also called *alternative density*.

The Gaussian NSBM is particularly suitable for modeling situations where the observations $X_{i,j}$ correspond to test statistics which are known to be approximately Gaussian.
We often assume that  the variance of the null distribution is known, equal to $\sigma_0=1$.

For the alternative distribution, we often consider  two cases. First,  the case where all $\sigma_{kl}$ are equal to a single known parameter $\sigma_1$.  It is for instance the case when 
$X_{i,j}$ correspond to test statistics which are known to be approximately Gaussian with variance asymptotically equal to 1.
As this is not always the case, we secondly consider the case where the  $\sigma_{kl}$  are unknown, potentially different from each other, and have to be estimated.

We denote $\nu_0=(0, \sigma_0)$ the null parameter (that we suppose known) and $\nu=(\nu_{kl})_{1\leq k, l\leq Q}=(\mu_{kl}, \sigma_{kl})_{1\leq k, l\leq Q}$ the parameters of the effects (with the two mentioned above cases depending on whether the alternative variance is supposed to be known or not). 
The global model parameter is $\theta=(\pi,w,\nu_0, \nu)$, where $\pi$ and $w$ come from the SBM.

## Data generation of the NSBM 

The global parameter $\theta$ is a list and its elements are `pi`, `w`, `nu0` and `nu`.

```{r}
theta <- list(pi=NULL, w=NULL, nu0=NULL, nu=NULL)
```


### SBM parameters

Denote `Q` the number of latent blocks in the SBM.
Say

```{r}
 Q <- 2
```

The parameters `pi` and `w` are the parameters of the latent binary SBM.

The parameter `pi` is a `Q`-vector indicating the mean proportions of nodes per block.
The vector `pi` has to be normalized:

```{r}
theta$pi <- c(2,1)/3
```

The parameter `w` represents the symmetric matrix  $w=(w_{kl})_{k,l\in\{1,\dots,Q\}}$ such that $w_{kl}\in(0,1)$ is a Bernoulli parameter indicating the probability that there is an edge between a node in group $k$ and a node in group $l$. The machine representation of `w` is a vector of length $\frac{Q(Q+1)}{2}$ containing the coefficients of the upper triangle of the matrix from left to right and from top to bottom.
For our graph with two latent blocks, let us consider 

```{r}
theta$w <-c(.8,.1,.9)
```

This means that the two blocks are communities as nodes have a large probability to be connected to nodes in the same block ( $w_{11}=0.8$ and $w_{22}=0.9$), and a low probability to connect to nodes in the other group ($w_{12}=w_{21}=0.1$).

### Gaussian parameters

In the Gaussian model, the null parameter `nu0` is a vector of length 2, giving the mean and the standard deviation of the Gaussian null distrubtion ($\nu_0=(0,\sigma_{0})$).
Mostly, we choose the standard normal distribution:

```{r}
theta$nu0 <- c(0,1)
```

The parameter `nu` is a matrix of dimensions $(\frac{Q(Q+1)}{2},2)$ : the first column indicates the Gaussian means $\mu_{kl}$ and the second column the standard deviations $\sigma_{kl}$ of the alternative distributions for each block pairs $(k,l)$ (corresponding to the coefficients of the upper triangle of respectively the matrices $\mu$ and \sigma$ from left to right and from top to bottom).  

For our  model with two latent blocks, we choose

```{r}
theta$nu <- array(0, dim = c(Q*(Q+1)/2, 2))
theta$nu[,1] <- c(4,4,4)
theta$nu[,2]<-c(1,1,1)
theta$nu
```

### Generate data

To generate a data set with, say, $p=10$ nodes from the corresponding NSBM we use the function `rnsbm()`:

```{r}
p=10
obs <- rnsbm(p, theta)
```

The function `rnsbm()` provides the data generated matrix, which is symmetric.
```{r}
round(obs$dataMatrix, digits = 2)
```

The function `rnsbm()` also provides the latent binary graph, named `latentAdj`, which is also symmetric:

```{r}
obs$latentAdj
```

as well as the latent blocks, named `latentZ`:

```{r}
obs$latentZ
```

A more convenient visualization of the graph and the clustering is given by the function `plot.igraph()` from the package `igraph`:

```{r}
library(igraph)
G=igraph::graph_from_adjacency_matrix(obs$latentAdj)
igraph::plot.igraph(G, vertex.size=5, vertex.dist=4, vertex.label=NA, vertex.color=obs$latentZ, edge.arrow.mode=0) 
```

and a more convenient visualization of the data is given by the function `plotGraphs()`:

```{r}
plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj)
```

# NSBM inference, `main_noisySBM()` and output

## The function with the options by default
The `main_noisySBM()` function is a core component of the `noisysbmGGM` package. 
This function implements the greedy algorithm to estimate model parameters and perform node clustering, and applies the multiple testing procedure based on the $l$-values approach to infer the underlying graph. 

The function applies to a symmetric real-valued square matrix `X`.
Let us see a basic example:

```{r}
p=30
obs <- rnsbm(p, theta)
X=obs$dataMatrix
```


```{r}
res1 <- main_noisySBM(X, Qup=10, alpha=0.1,nbCores=1)
```

By default, the algorithm 

- starts from a NSBM with `Qup` blocks. If `Qup` is not specified, the algorithm start with $Qup=10$ blocks by default. You can choose another value of `Qup` or choose $Qup=p$, but be aware that it will be time expensive if $p$ is large.

- applies the greedy algorithm with the variances under the null and the alternative fixed to 1 (`sigma0`$=$ `sigma1` $=1$) (if not desired, see the following Section)

- and infers a graph with a multiple testing procedure at level `alpha`$=0.1$.

The output of `main_noisySBM()` is a list containing the estimated parameters of the NSBM, the inferred clustering and the estimated number of clusters, and the adjacency matrix of the inferred graph.



Here, for instance, we obtain the estimates of the model parameters $\theta$ by

```{r}
res1$theta
```

A node clustering ans the estimated number of clusters are given by

```{r}
res1$Z
res1$Q
```

The inferred latent graph is given by his adjacency matrix

```{r}
res1$A
```

To visualize the result, we can use again the function `graphPlots()` that also returns the FDR and TDR:

```{r}
plotGraphs(dataMatrix= X, binaryTruth=obs$latentAdj, inferredGraph= res1$A)
```


Node clusterings can be compared with the adjusted Rand index.
The package provides a function `ARI()` to evaluate this indicator.
On simulated data we can compare to the true clustering.

```{r}
ARI(res1$Z, obs$latentZ)
```

## NSBM with unknow variances

When the variance $\sigma^2_{kl}$ of the alternative distributions are unknown, one can turn the parameter `NIG` to `TRUE` in order to use a procedure which also estimates the values of $\sigma_{kl}$. (`NIG` refers to the Normal Inverse Gaussian distribution used as prior when both the mean and the variance of a Gaussian distribution are unknown)



```{r}
res2 <- main_noisySBM(X,NIG=TRUE, Qup=10, alpha=0.1,nbCores=1)
```



The output are the same. Note that `res2$theta$nu[,2]` is no longer equal to `1 1 1` but is estimated.
```{r}
res2$Q
res2$Z
```


```{r}
res2$theta
```

```{r}
ARI(res2$Z, obs$latentZ)
```
```{r}
plotGraphs(dataMatrix= X, binaryTruth=obs$latentAdj, inferredGraph= res2$A)
```

## Other options of the `main_noisySBM()` function :

### Initialization

The argument `nbOfZ` is the number of clustering initializations (12 initializations done at random) and you may change default value to increase or decrease the number of initial points of the algorithm.

### Parallel computing

By default, computations are parallelized using the maximum number of available cores. However, the argument nbCores can be used to customize this choice.

NB : For Apple Silicon processors, the detection of the number of cores available does not work. In that case, set the number of cores with nbCores.

# GGM inference, `main_noisySBM_GGM()` and output

The `main_noisySBM_GGM()` function is a key feature of the `noisysbmGGM` package, dedicated to Gaussian Graphical Model (GGM) inference. This function takes an $n$-sample of a Gaussian vector of dimension $p$ and infers the GGM associated with the partial correlation structure of the vector. 
The GGM inference procedure allows to detect significant direct relationships between variables, helping users uncover meaningful interactions, while seeking to control the number of false discoveries.

The function applies to a $n$ by $p$ matrix `X` where $p$ is the number of variables and $n$ the number of observations. $n$ can be smaller or greater than $p$, except for the "zTransform" method where $n$ has to be larger than $p$.

The function `main_noisySBM_GGM()` starts by computing a $p$ by $p$ matrix composed of well chosen test statistics among the following 

```{r}
c("Ren","Jankova_NW","Jankova_GL","Liu_SL","Liu_L","zTransform")
```

and then apply `main_noisySBM()` to this matrix.
Therefore the same parameters/options as `main_noisySBM()` are available.
The value of `NIG` is automatically chosen according to the selected method (NIG=FALSE except for "Liu_SL" and "Liu_L" test statistics as input) but it can also be imposed by hand.
By default, the parameters `alpha` corresponding to the level of the multiple testing procedure is again taken to $0.1$.

Let us see a basic example:

```{r}
X=GGMtest$dataMatrix
```

```{r}
resGGM <- main_noisySBM_GGM(X,Meth="Ren",NIG=FALSE, Qup=10, alpha=0.1,nbCores=1)
```

The output of `main_noisySBM_GGM()` are the same that `main_noisySBM()` :
For instance, the inferred latent GGM is given by his inferred adjacency matrix :

```{r}
resGGM$A
```

A more convenient visualization of the graph and the clustering is given by the function `plot.igraph()` from the package `igraph`:

```{r}
Gggm=igraph::graph_from_adjacency_matrix(resGGM$A)
igraph::plot.igraph(Gggm, vertex.size=5, vertex.dist=4, vertex.label=NA, edge.arrow.mode=0) 
```

