Skip to contents

Computes a pairwise Pearson correlation matrix from a genotype matrix. Supports two variance conventions:

"sample"

Standard sample variance with N-1 denominator (default). Uses mean imputation for missing genotypes, then Rfast::cora (if available) or base cor().

"population"

Population variance with N denominator, matching GCTA-style tools (e.g. DENTIST, GCTA –make-grm). Per-SNP means are computed from non-missing values; missing entries are set to zero after centering so they do not contribute to cross-products. Cross-products are normalized by the total sample count N, not by pairwise non-missing counts.

Usage

compute_LD(X, method = c("sample", "population"), trim_samples = FALSE)

Arguments

X

Numeric genotype matrix (samples x SNPs). May contain NA for missing genotypes.

method

Character, either "sample" (default, N-1 denominator) or "population" (N denominator, GCTA-style). Partial matching is supported.

trim_samples

Logical. If TRUE and method = "population", drops trailing samples so that nrow(X) is a multiple of 4, matching PLINK .bed file chunk processing. Ignored when method = "sample". Default is FALSE.

Value

A symmetric correlation matrix with row and column names taken from colnames(X).

Details

Missing data handling. With method = "sample", missing values are mean-imputed per SNP before computing the full Pearson correlation matrix. With method = "population", per-SNP means are computed from non-missing values, the matrix is centered, then NAs are set to 0 so that missing pairs contribute nothing to the cross-product. The denominator is always the total sample count N (after optional trimming), matching the original GCTA formula: $$\text{Var}(X_i) = E[X_i^2] - E[X_i]^2$$ $$\text{Cor}(X_i, X_j) = \frac{\text{Cov}(X_i, X_j)}{\sqrt{\text{Var}(X_i)\,\text{Var}(X_j)}}$$

Zero-variance SNPs. Any monomorphic SNP will have zero variance, producing NaN correlations. These are set to 0 in the returned matrix; the diagonal is forced to 1.

Examples

if (FALSE) { # \dontrun{
X <- matrix(sample(0:2, 500, replace = TRUE), nrow = 50)
colnames(X) <- paste0("rs", 1:10)

# Standard sample correlation (default)
R1 <- compute_LD(X)

# GCTA-style population variance
R2 <- compute_LD(X, method = "population")
} # }