Compute LD (Linkage Disequilibrium) Correlation Matrix from Genotypes

Computes a pairwise Pearson correlation matrix from a genotype matrix. Supports three variance conventions:

"sample": Standard sample variance with N-1 denominator (default). Uses mean imputation for missing genotypes, then Rfast::cora (if available) or base cor().
"population": Population variance with N denominator, matching GCTA-style tools (e.g. DENTIST, GCTA –make-grm). Per-SNP means are computed from non-missing values; missing entries are set to zero after centering so they do not contribute to cross-products. Cross-products are normalized by the total sample count N, not by pairwise non-missing counts.
"gcta": GCTA per-pair missing data correction. Like "population" but applies a correction term for each SNP pair based on the number of jointly non-missing samples. Matches the exact formula from the DENTIST C++ binary's calcLDFromBfile_gcta. Use this when missingness varies substantially across SNPs and accuracy of individual LD entries matters.

Usage

compute_LD(
  X,
  method = c("sample", "population", "gcta"),
  backend = c("internal", "snprelate", "snpstats"),
  trim_samples = FALSE,
  shrinkage = 0
)

Arguments

X

Numeric genotype matrix (samples x SNPs). May contain NA for missing genotypes.

method

Character, one of "sample" (default, N-1 denominator), "population" (N denominator, GCTA-style), or "gcta" (per-pair missing data correction). Partial matching is supported.

backend

Character, one of "internal" (default), "snprelate", or "snpstats". Controls which library computes the correlation matrix when method = "sample":

"internal": Uses Rfast::cora if available, otherwise base cor().
"snprelate": Requires a temporary GDS file; uses SNPRelate::snpgdsLDMat(method = "corr").
"snpstats": Converts to SnpMatrix; uses snpStats::ld(, stat = "R").

The "snprelate" and "snpstats" backends are only supported with method = "sample"; combining them with other methods will raise an error.

trim_samples

Logical. If TRUE and method is "population" or "gcta", drops trailing samples so that nrow(X) is a multiple of 4, matching PLINK .bed file chunk processing. Ignored when method = "sample". Default is FALSE.

shrinkage

Numeric in (0, 1]. Shrink the LD matrix toward the identity: R_s = (1 - shrinkage) * R + shrinkage * I. Useful for regularizing LD for summary-statistics-based methods such as lassosum (Mak et al 2017). Default is 0 (no shrinkage).

Value

A symmetric correlation matrix with row and column names taken from colnames(X).

Details

Missing data handling. With method = "sample", missing values are mean-imputed per SNP before computing the full Pearson correlation matrix. With method = "population", per-SNP means are computed from non-missing values, the matrix is centered, then NAs are set to 0 so that missing pairs contribute nothing to the cross-product. The denominator is always the total sample count N (after optional trimming), matching the original GCTA formula: $$\text{Var}(X_i) = E[X_i^2] - E[X_i]^2$$ $$\text{Cor}(X_i, X_j) = \frac{\text{Cov}(X_i, X_j)}{\sqrt{\text{Var}(X_i)\,\text{Var}(X_j)}}$$

Zero-variance SNPs. Any monomorphic SNP will have zero variance, producing NaN correlations. These are set to 0 in the returned matrix; the diagonal is forced to 1.

Examples

if (FALSE) { # \dontrun{
X <- matrix(sample(0:2, 500, replace = TRUE), nrow = 50)
colnames(X) <- paste0("rs", 1:10)

# Standard sample correlation (default)
R1 <- compute_LD(X)

# GCTA-style population variance
R2 <- compute_LD(X, method = "population")

# GCTA-style with per-pair missing data correction
R3 <- compute_LD(X, method = "gcta")
} # }