---
title: "Description of the workflow for constructing a genetic map using *hsrecombi*"
output: rmarkdown::html_vignette
author: D. Wittenburg
vignette: >
  %\VignetteIndexEntry{Description of the workflow for constructing a genetic map using *hsrecombi*}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  tidy.opts = list(width.cutoff = 100), tidy = TRUE
)
```

## Introduction

*hsrecombi* can approximate the genetic-map positions of genetic markers from SNP genotypes in half-sib families. It is referred to [Hampel et al. (2018, Fron Gen)](https://doi.org/10.3389/fgene.2018.00186) and [Qanbari & Wittenburg (2020, Genet Sel Evol)](https://doi.org/10.1186/s12711-020-00593-z) for more information on the methodology. The workflow relies on  paternal half-sib families but it can be adapted to maternal half-sib families with minor modifications.

In **Part I**, genotypic data from progeny and their common parents are simulated with the R package [*AlphaSimR*](https://CRAN.R-project.org/package=AlphaSimR) and reshaped into the PLINK format. If genotypic data are already available in the required PLINK format, the workflow starts directly at **Part II**. 

Required input files for Part II:

* map files for each chromosome `map_chr<chr>.map`
* data files for each chromosome `hsphase_input_chr<chr>.raw`

**Note:** A make pipeline for the main analysis (Part II) is available at [GitHub](https://github.com/wittenburg/hsrecombi).

```{r setup}
library(hsrecombi) 
library(AlphaSimR)
library(doParallel)
library(ggplot2)
library(rlist)
Sys.time()
```

### Global variables
```{r global}
# number of chromosomes
nchr <- 2
# Number of simulated SNPs (e.g., p = 1500 resembles an average bovine chromosome)
p <- 150
# Number of simulated progeny in each half-sib family 
n <- 1000
# Directory for (simulated) data
path.dat <- 'data'
dir.create(path.dat, showWarnings = FALSE)
# Directory for output
path.res <- 'results'
dir.create(path.res, showWarnings = FALSE)
# Number of computing clusters allocated
nclust <- 2
```
For instance, executing the following workflow with $p=1500$ and $n\geq 1000$ requires about 120 Mb storage space and 10 min computing time (2.6 GHz processor). With $p=150$, it takes about 7 Mb and 20 sec.

## Part I: Data simulation and preparation

### 1: Simulation of genetic data with R package *AlphaSimR*

```{r alphasim}
founderPop <- runMacs2(nInd = 1000, nChr = nchr, segSites = p)
SP <- SimParam$new(founderPop)
SP$setSexes("yes_sys")
# Enable tracing location of recombination events
SP$setTrackRec(TRUE)  

pop <- newPop(founderPop)
N <- 10; ntotal <- N * n
my_pop <- selectCross(pop = pop, nFemale = 500, nMale = N, use = "rand", nCrosses = ntotal)

probRec <- list()
for(chr in 1:nchr){
  co.pat <- matrix(0, ncol = p, nrow = ntotal)
  for(i in 1:ntotal){
    if(nrow(SP$recHist[[1000 + i]][[chr]][[2]]) > 1){
      # 1. line contains 1 1 by default
      loci <- SP$recHist[[1000 + i]][[chr]][[2]][-1, 2] 
      co.pat[i, loci] <- 1 
    }
  }
  probRec[[chr]] <- colMeans(co.pat)
}

save(list = c('SP', 'founderPop', 'pop', 'my_pop', 'ntotal', 'probRec'), file = 'data/pop.Rdata')
```

Two chromosomes, each with $p$ SNPs and 1 Morgan length, are simulated. The founder population consists of 1\,000 animals with equal gender distribution. The follow-up generation consists of $N$ paternal-half sib families with equal number of progeny $n$. The study design with the equal number of progeny per family was preferred here for simplicity but this is not required in the subsequent steps.

### 2: Selection of sire haplotypes and progeny genotypes
```{r genetic-data}
PAT <- my_pop@father
rown <- paste(rep(unique(PAT), each = 2), c(1, 2), sep = '_')
H.pat <- pullSegSiteHaplo(pop)[rown, ]
X <- pullSegSiteGeno(my_pop)
# Physical position of markers in Mbp
map.snp <- lapply(founderPop@genMap, function(z) z * 100)
```
Note that `founderPop@genMap` includes genetic positions in Morgan units but physical positions are required in the next steps.

### 3: Reshaping data into PLINK format
```{r plink-format}
map <- data.frame(Chr = rep(1:length(map.snp), unlist(lapply(map.snp, length))), 
                  Name = paste0('SNP', 1:length(unlist(map.snp))),
                  locus_Mb = unlist(map.snp), 
                  locus_bp = unlist(map.snp) * 1e+6)

colnames(X) <- map$Name
FID <- 'FAM001'
IID <- my_pop@id
MAT <- my_pop@mother
SEX <- 2
PHENOTYPE <- -9

for(chr in 1:nchr){
  write.table(map[map$Chr == chr, ], file.path(path.dat, paste0('map_chr', chr, '.map')), 
              col.names = F, row.names = F, quote = F)
  write.table(cbind(FID, IID, PAT, MAT, SEX, PHENOTYPE, X[, map$Chr == chr]), 
              file.path(path.dat, paste0('hsphase_input_chr', chr, '.raw')), col.names = T, row.names = F, quote = F) 
}
```


## Part II: Main analysis 
```{r parallel-computing}
cl <- makeCluster(nclust)
registerDoParallel(cl)
```


### Steps 1-5: Calculation of pairwise recombination rates
```{r recombination-rate}
out <- foreach(chr = 1:nchr, .packages = 'hsrecombi') %dopar% {
  
  # 1: Physical  map
  map <- read.table(file.path(path.dat, paste0('map_chr', chr, '.map')), col.names = c('Chr', 'SNP', 'locus_Mb', 'locus_bp'))
 
  # 2: Genotype matrix
  genomatrix <- data.table::fread(file.path(path.dat, paste0('hsphase_input_chr', chr, '.raw')))
  X <- as.matrix(genomatrix[, -c(1:6)])
  X[is.na(X)] <- 9 # required for hsphase
  
  # 3: Assign daughters to sire IDs
  daughterSire <- genomatrix$PAT
  
  # 4: Estimate sire haplotypes and format data 
  hap <- makehappm(unique(daughterSire), daughterSire, X)
  save('hap', file = file.path(path.res, paste0('hsphase_output_chr', chr, '.Rdata')))
  
  # Check order and dimension
  io <- sapply(1:nrow(map), function(z){grepl(x = colnames(X)[z], pattern = map$SNP[z])})
  if(sum(io) != nrow(map)) stop("ERROR in dimension")
  
  # 5: Estimate recombination rates
  res <- hsrecombi(hap, X)
  final <- editraw(res, map)
  save(list = c('final', 'map'), file = file.path(path.res, paste0("Results_chr", chr, ".Rdata")))
  
  ifelse(nrow(final) > 0, 'OK', 'no result')
}
print(which(unlist(out) == 'OK'))
```

In step 4, function `makehap` would be sufficient but `makehappm` allows comparison with a deterministic approach shown later. Note that sire haplotypes can also be obtained by some other (external) software. Then sire haplotypes and sire-to-daughter information need to be reshaped with `makehaplist` in step 4.

### 6: Check for candidates of misplacement
```{r misplaced}
# 6a: Filter SNPs with unusually large recombination rate to neighbouring (30) SNPs
excl <- foreach(chr = 1:nchr, .packages = 'hsrecombi') %dopar% {
  load(file.path(path.res, paste0("Results_chr", chr, ".Rdata")))
  checkCandidates(final, map)
}

# 6b: Heatmap plot of recombination rates for visual verification, e.g.:
chr <- 2
load(file.path(path.res, paste0("Results_chr", chr, ".Rdata")))
cand <- excl[[chr]][1]
win <- match(cand, map$SNP) + (-100:100)
win <- win[(win >= 1) & (win <= nrow(map))]

target <- final[(final$SNP1 %in% map$SNP[win]) & (final$SNP2 %in% map$SNP[win]), ]
target$SNP1 <- match(target$SNP1, map$SNP)
target$SNP2 <- match(target$SNP2, map$SNP)

ggplot(data = target, aes(SNP2, SNP1, fill = theta)) + 
  geom_tile() +
  xlab("Locus 2") +
  ylab("Locus 1") +
  coord_equal() + 
  scale_y_continuous(trans = "reverse") +
  theme(panel.background = element_blank(), 
        panel.grid.major = element_line(colour = "grey", linewidth = 0.1),
        panel.grid.minor = element_line(colour = "grey")) +
  theme(text = element_text(size = 18)) +
  scale_fill_gradientn(colours = c('yellow', 'red'), limits = c(0, 1+1e-10), na.value = 'white')

# -> nothing conspicious
excl[[1]] <- excl[[2]] <- NA
```

Orange colour in the SNP neighbourhood implies an increased recombination rate. In case of a misplaced marker or a cluster of misplaced markers, the heatmap is characterised by an orange band.

### 7: Approximation of genetic positions

```{r genetic-position}
pos <- foreach(chr = 1:nchr, .packages = 'hsrecombi') %dopar% {
  load(file.path(path.res, paste0("Results_chr", chr, ".Rdata")))
  geneticPosition(final, map, exclude = excl[[chr]])
}
```

Genetic position is reported as NA if the corresponding SNP has not been considered in the analysis (e.g., when no sire was heterozygous at the corresponding SNP).

```{r stop-parallel}
stopCluster(cl)
```


### 8: Plot of physical versus genetic-map positions

```{r plot-1}
data <- data.frame(Chr = rep(1:length(pos), times = unlist(list.select(pos, length(pos.Mb)))), 
                   Mbp = unlist(list.select(pos, pos.Mb)), cM = unlist(list.select(pos, pos.cM)))
ggplot(data, aes(x = Mbp, y = cM)) + geom_point(na.rm = T) + facet_grid(Chr ~ .)
```


### 9: Plot of genetic mapping function

The genetic mapping function that fits best to the estimated data can be retrieved from a system of mapping functions [(Rao et al. 1977, Human Hered)](https://doi.org/10.1159/000152856); putatively misplaced markers need to be excluded.
Given a parameter $p$, recombination rate $\theta \in (0,0.5)$ is mapped to Morgan units by
$$
f(\theta; p) = \frac{1}{6}(p(2p - 1)(1 - 4p) log(1 - 2\theta) +
               16p(p - 1)(2p - 1) \text{atan}(2\theta) +
               2p(1 - p)(8p + 2)\text{atanh}(2\theta) +
               6(1 - p)(1 - 2p)(1 - 4p)\theta) 
$$
```{r plot-2}
for (chr in 1:nchr) {
  load(file.path(path.res, paste0("Results_chr", chr, ".Rdata")))
  final$theta[(final$SNP1 %in% excl[[chr]]) | (final$SNP2 %in% excl[[chr]])] <- NA
  final$dist_M <- (pos[[chr]]$pos.cM[final$SNP2] - pos[[chr]]$pos.cM[final$SNP1]) / 100
  out <- bestmapfun(theta = final$theta, dist_M = final$dist_M)
  plot(final$dist_M, final$theta, xlab = 'genetic distance (Morgan)', ylab = 'recombination rate', col = 8)
  x <- seq(0, 0.5, 0.01); y <- rao(out$mixing, x); points(y, x, type = 'l', lwd = 2)
  legend('bottomright', legend = paste0('map function (p=', round(out$mixing, 3), ')'), lwd = 2, lty = 1, bty = 'n')
}
```
The mixing parameter $p$ is estimated via quadratic optimisation in `bestmapfun`. A value of $p=0$ would match to Morgan's mapping function, $p=0.25$ to Carter, $p=0.5$ to Kosambi and $p=1$ resembles Haldane's mapping function.


## Part III: Visual comparison with simulated data and deterministic approach

```{r selection}
chr <- 2
```

### 1: Check with simulated data

```{r check-simulated}
load(file.path(path.dat, 'pop.Rdata'))
sim.cM <- cumsum(probRec[[chr]]) * 100
```
The position of a recombination event is traced in `SP$recHist` which has a nested list structure (individual, chromosome, female/male haplotype). The recombination rate between adjacent SNPs is determined as the proportion of recombinant progeny and the positions are derived as cumulative sum over recombination rates.

### 2: Check with deterministic approach

```{r check-deterministic}
load(file.path(path.res, paste0('hsphase_output_chr', chr, '.Rdata')))
hsphase.cM <- hap$gen
```

The results are compared to the deterministic approach of [Ferdosi et al. (2014)](https://doi.org/10.1186/1297-9686-46-11) provided in the R package [*hsphase*]( https://CRAN.R-project.org/package=hsphase). Functions of that package are called in `makehap` and `makehappm`.

### 3: Comparative plot of physical versus genetic-map positions

```{r plot-3}
par(mar=c(5.1, 4.1, 4.1, 9.1), xpd = TRUE)
plot(pos[[chr]]$pos.Mb, pos[[chr]]$pos.cM, xlab = 'physical position (Mbp)', ylab = 'genetic position (cM)', 
     ylim = range(c(sim.cM, hsphase.cM, pos[[chr]]$pos.cM), na.rm = T), pch = 20)
points(pos[[chr]]$pos.Mb, sim.cM, pch = 20, col = 4)
points(pos[[chr]]$pos.Mb, hsphase.cM, pch = 20, col = 8)
legend('topleft', inset=c(1.01,0), legend = c('simulated', 'likelihood-based', 'deterministic'), pch = 20, col = c(4, 1, 8), bty = 'n')
```
The likelihood-based approach requires a sufficiently large sample size if SNP density is high. A verification of the bias of genetic-map positions can be found in the Supplemental Material of [Qanbari & Wittenburg (2020, Genet Sel Evol)](https://doi.org/10.1186/s12711-020-00593-z).

```{r cleanup}
unlink(path.dat, recursive = TRUE)
unlink(path.res, recursive = TRUE)
Sys.time()
```
