Skip to contents

Performs LD pruning using one of two backends. The default "hclust" backend computes the full correlation matrix, builds a single-linkage hierarchical clustering on the distance (1 - |cor|), and keeps one representative column per cluster. The "snprelate" backend delegates to SNPRelate::snpgdsLDpruning, which performs a sliding-window greedy prune directly on a temporary GDS file.

Usage

ld_prune_by_correlation(
  X,
  cor_thres = 0.8,
  backend = c("hclust", "snprelate"),
  verbose = FALSE
)

Arguments

X

Numeric matrix. Columns are the variables to prune (typically SNP genotype dosages); rows are observations.

cor_thres

Numeric in (0, 1). Absolute correlation threshold. Columns whose pairwise |cor| exceeds this are grouped; one survivor is kept per group. Default 0.8.

backend

Character, one of "hclust" (default) or "snprelate". Controls the pruning algorithm:

"hclust"

Uses the internal hierarchical-clustering approach with Rfast::cora (if available) or base cor().

"snprelate"

Requires SNPRelate and gdsfmt. Creates a temporary GDS file and runs SNPRelate::snpgdsLDpruning(method = "corr").

verbose

Logical. If TRUE, print progress messages. Default FALSE.

Value

A list with:

X.new

Matrix containing the retained columns of X.

filter.id

Integer vector of the column indices of X that were retained (in original order).

Examples

set.seed(1)
X <- matrix(rnorm(100 * 5), 100, 5)
X[, 2] <- X[, 1] + rnorm(100, sd = 0.01)   # near-duplicate of col 1
res <- ld_prune_by_correlation(X, cor_thres = 0.9)
ncol(res$X.new)
#> [1] 4