Compute LD (Linkage Disequilibrium) Correlation Matrix from Genotypes
Source:R/misc.R
compute_LD.RdComputes a pairwise Pearson correlation matrix from a genotype matrix. Supports three variance conventions:
"sample"Standard sample variance with N-1 denominator (default). Uses mean imputation for missing genotypes, then
Rfast::cora(if available) or basecor()."population"Population variance with N denominator, matching GCTA-style tools (e.g. DENTIST, GCTA –make-grm). Per-SNP means are computed from non-missing values; missing entries are set to zero after centering so they do not contribute to cross-products. Cross-products are normalized by the total sample count N, not by pairwise non-missing counts.
"gcta"GCTA per-pair missing data correction. Like
"population"but applies a correction term for each SNP pair based on the number of jointly non-missing samples. Matches the exact formula from the DENTIST C++ binary'scalcLDFromBfile_gcta. Use this when missingness varies substantially across SNPs and accuracy of individual LD entries matters.
Arguments
- X
Numeric genotype matrix (samples x SNPs). May contain
NAfor missing genotypes.- method
Character, one of
"sample"(default, N-1 denominator),"population"(N denominator, GCTA-style), or"gcta"(per-pair missing data correction). Partial matching is supported.- backend
Character, one of
"internal"(default),"snprelate", or"snpstats". Controls which library computes the correlation matrix whenmethod = "sample":"internal"Uses
Rfast::coraif available, otherwise basecor()."snprelate"Requires a temporary GDS file; uses
SNPRelate::snpgdsLDMat(method = "corr")."snpstats"Converts to
SnpMatrix; usessnpStats::ld(, stat = "R").
The
"snprelate"and"snpstats"backends are only supported withmethod = "sample"; combining them with other methods will raise an error.- trim_samples
Logical. If
TRUEandmethodis"population"or"gcta", drops trailing samples so thatnrow(X)is a multiple of 4, matching PLINK .bed file chunk processing. Ignored whenmethod = "sample". Default isFALSE.- shrinkage
Numeric in (0, 1]. Shrink the LD matrix toward the identity:
R_s = (1 - shrinkage) * R + shrinkage * I. Useful for regularizing LD for summary-statistics-based methods such as lassosum (Mak et al 2017). Default is 0 (no shrinkage).
Details
Missing data handling.
With method = "sample", missing values are mean-imputed per SNP
before computing the full Pearson correlation matrix.
With method = "population", per-SNP means are computed from
non-missing values, the matrix is centered, then NAs are set to 0
so that missing pairs contribute nothing to the cross-product.
The denominator is always the total sample count N
(after optional trimming), matching the original GCTA formula:
$$\text{Var}(X_i) = E[X_i^2] - E[X_i]^2$$
$$\text{Cor}(X_i, X_j) = \frac{\text{Cov}(X_i, X_j)}{\sqrt{\text{Var}(X_i)\,\text{Var}(X_j)}}$$
Zero-variance SNPs.
Any monomorphic SNP will have zero variance, producing NaN
correlations. These are set to 0 in the returned matrix; the diagonal
is forced to 1.
Examples
if (FALSE) { # \dontrun{
X <- matrix(sample(0:2, 500, replace = TRUE), nrow = 50)
colnames(X) <- paste0("rs", 1:10)
# Standard sample correlation (default)
R1 <- compute_LD(X)
# GCTA-style population variance
R2 <- compute_LD(X, method = "population")
# GCTA-style with per-pair missing data correction
R3 <- compute_LD(X, method = "gcta")
} # }