DENTIST: Summary Statistics QC via LD-Based Imputation
Haochen Sun with Claude Opus 4.6
2026-03-01
Source:vignettes/dentist.Rmd
dentist.RmdOverview
DENTIST (Detecting Errors iN analyses of summary staTISTics) performs quality control of GWAS summary statistics by iteratively imputing z-scores using linkage disequilibrium (LD) from a reference panel. Variants whose observed z-scores deviate significantly from their LD-imputed values are flagged as outliers.
The pecotmr package provides a pure R implementation of
DENTIST that can be used in two ways:
-
File-based input via
dentist_from_files()— takes PLINK binary files and a summary statistics file, matching the interface of the original DENTIST C++ binary. -
Direct input via
dentist_single_window()— takes z-scores and an LD matrix (or genotype matrix) directly, suitable for integration into analysis pipelines.
We also demonstrate how the alternative SLALOM + RAISS pipeline achieves comparable QC and imputation on the same data, giving users two complementary approaches for summary statistics quality control.
Toy Data
We use a toy dataset on chromosome 22 distributed with the package
under inst/extdata/. The full dataset has 17,421 variants
and 165 reference panel individuals; we subset to a small region for
fast demonstration.
# Locate package data
# system.file() finds files within installed packages;
# during development, use the inst/extdata path directly
extdata_dir <- system.file("extdata", package = "pecotmr")
if (extdata_dir == "" || !file.exists(file.path(extdata_dir, "toy_ref.bed"))) {
# Fallback for development: look relative to the package source root
extdata_dir <- file.path(getwd(), "..", "inst", "extdata")
if (!file.exists(file.path(extdata_dir, "toy_ref.bed"))) {
extdata_dir <- file.path(getwd(), "inst", "extdata")
}
}
bfile <- file.path(extdata_dir, "toy_ref")
sumstat_gz <- file.path(extdata_dir, "toy_summary.txt.gz")
# Load summary statistics (DENTIST format: SNP A1 A2 freq beta se p N)
sumstat <- as.data.frame(vroom(sumstat_gz, show_col_types = FALSE))
cat(sprintf("Full dataset: %d variants\n", nrow(sumstat)))## Full dataset: 17421 variants
head(sumstat, 3)## SNP A1 A2 freq beta se p N
## 1 rs11089128 G A 0.0804 -0.008047 0.014364 0.5753294 10000
## 2 rs7288972 C T 0.1741 -0.025328 0.028768 0.3786304 10000
## 3 rs11167319 G T 0.1339 -0.001455 0.023300 0.9502074 10000
# Load reference panel variant info
bim <- as.data.frame(vroom(paste0(bfile, ".bim"), col_names = c("chrom", "variant_id", "gd", "pos", "A1", "A2"),
show_col_types = FALSE))
# Select ~500 variants from a contiguous region for fast execution
region_start <- bim$pos[1]
region_end <- region_start + 2e6
region_snps <- bim$variant_id[bim$pos >= region_start & bim$pos <= region_end]
cat(sprintf("Selected %d variants in %.1f Mb region (pos %s - %s)\n",
length(region_snps),
(region_end - region_start) / 1e6,
format(region_start, big.mark = ","),
format(region_end, big.mark = ",")))## Selected 608 variants in 2.0 Mb region (pos 14,560,203 - 16,560,203)
Case 1: File-Based DENTIST
The dentist_from_files() function mirrors the interface
of the DENTIST C++ binary. It takes a summary statistics file path and a
PLINK bfile prefix, then internally handles allele alignment, LD
computation, and windowed imputation.
# Subset summary statistics to the selected region (in R, no plink needed)
sumstat_sub <- sumstat[sumstat$SNP %in% region_snps, ]
sub_sumstat <- tempfile(fileext = ".txt")
readr::write_tsv(sumstat_sub, sub_sumstat)
cat(sprintf("Subset: %d variants in summary stats\n", nrow(sumstat_sub)))## Subset: 608 variants in summary stats
# Run DENTIST via file-based interface
# dentist_from_files() handles matching between summary stats and bfile internally
# nSample = NULL auto-detects reference panel size from .fam file
result_files <- dentist_from_files(
gwas_summary = sub_sumstat,
bfile = bfile,
nSample = NULL,
window_size = 2000000,
nIter = 8,
pValueThreshold = 5.0369e-08,
propSVD = 0.4,
duprThreshold = 0.99,
verbose = TRUE
)## Reading summary statistics...
## 608 variants read
## Reading reference panel bim...
## Rows: 17421 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): X2, X5, X6
## dbl (3): X1, X3, X4
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## 17421 variants in reference panel
##
## 608 SNPs in common
##
## Aligning alleles via allele_qc()...
##
## 608 variants after allele QC (from 608 common)
##
## Loading genotype data...
##
## Using reference panel sample size: N = 165
##
## Computing LD for 608 SNPs from 165 samples (291 missing genotypes)...
##
## Running R DENTIST implementation...
##
## 98 duplicated variants out of a total of 608 were found at r threshold of 0.995
##
## Done: 608 variants tested, 119 outliers detected (19.6%)
dentist_result <- result_files$result
cat(sprintf("\nDENTIST result: %d variants, %d outliers (%.1f%%)\n",
nrow(dentist_result),
sum(dentist_result$outlier),
100 * mean(dentist_result$outlier)))##
## DENTIST result: 608 variants, 119 outliers (19.6%)
## SNP original_z imputed_z rsq outlier outlier_stat
## 1 rs11089128 -0.56021999 -0.26562145 0.8154480 FALSE 0.4702648
## 2 rs7288972 -0.88042269 -0.45830626 0.5456049 FALSE 0.3921307
## 3 rs11167319 -0.06244635 -0.37445027 0.4994233 FALSE 0.1944686
## 4 rs8138488 0.23059201 0.01738338 0.6515071 FALSE 0.1304415
## 5 rs2186521 0.37786517 -0.23479057 0.4656640 FALSE 0.7024551
## 6 rs2027649 -0.19730337 -0.47009297 0.7900872 FALSE 0.3545005
The dentist_from_files() return value is a list with
three components:
-
result— data frame with outlier detection results (one row per input variant) -
sum_stat— aligned summary statistics after allele QC -
LD_mat— the LD correlation matrix used for imputation
Case 2: Single-Window DENTIST vs SLALOM + RAISS
For pipeline integration, dentist_single_window()
accepts z-scores and an LD matrix directly. We compare this with the
SLALOM + RAISS pipeline, which uses a different approach to achieve the
same goal: identifying problematic variants and imputing corrected
z-scores.
DENTIST performs iterative SVD-truncated imputation:
it imputes all z-scores from LD, flags outliers based on the discrepancy
statistic (z_obs - z_imp)² / (1 - r²), removes flagged
variants, and repeats.
SLALOM computes the DENTIST-S statistic relative to the lead variant to flag outliers among variants in high LD. RAISS then imputes z-scores for a given set of variants using LD-based regression from the remaining variants.
Prepare aligned data
# Use the aligned data from dentist_from_files for consistency
# This ensures allele QC has already been applied
aligned_sumstat <- result_files$sum_stat
LD_mat <- result_files$LD_mat
n_samples <- nrow(vroom(paste0(bfile, ".fam"), col_names = FALSE, show_col_types = FALSE))
z_scores <- aligned_sumstat$z
n_snps <- length(z_scores)
cat(sprintf("Aligned data: %d variants, %d reference samples\n", n_snps, n_samples))## Aligned data: 608 variants, 165 reference samples
# Build a position-sorted variant info data frame for RAISS
bim_full <- as.data.frame(vroom(paste0(bfile, ".bim"), col_names = c("chrom", "variant_id", "gd", "pos", "A1", "A2"),
show_col_types = FALSE))
bim_aligned <- bim_full[match(aligned_sumstat$SNP, bim_full$variant_id), ]
# Sort everything by position (required by RAISS)
pos_order <- order(bim_aligned$pos)
ref_panel_df <- data.frame(
chrom = bim_aligned$chrom[pos_order],
pos = bim_aligned$pos[pos_order],
variant_id = bim_aligned$variant_id[pos_order],
A1 = bim_aligned$A1[pos_order],
A2 = bim_aligned$A2[pos_order],
stringsAsFactors = FALSE
)
z_sorted <- z_scores[pos_order]
LD_sorted <- LD_mat[pos_order, pos_order]Run DENTIST (single-window)
dentist_sw <- dentist_single_window(
zScore = z_sorted,
R = LD_sorted,
nSample = n_samples,
pValueThreshold = 5.0369e-08,
propSVD = 0.4,
nIter = 8,
duprThreshold = 0.99
)## 98 duplicated variants out of a total of 608 were found at r threshold of 0.995
cat(sprintf("DENTIST single-window: %d variants, %d outliers (%.1f%%)\n",
nrow(dentist_sw),
sum(dentist_sw$outlier),
100 * mean(dentist_sw$outlier)))## DENTIST single-window: 608 variants, 119 outliers (19.6%)
Run SLALOM (QC)
SLALOM uses Approximate Bayes Factors for fine-mapping and the DENTIST-S statistic for outlier detection. It identifies variants whose z-scores are inconsistent with the lead variant’s LD pattern. Note that SLALOM focuses on variants in high LD with the lead variant (r² > threshold), so it typically flags fewer outliers than DENTIST’s genome-wide iterative approach.
slalom_result <- slalom(
zScore = z_sorted,
R = LD_sorted,
nlog10p_dentist_s_threshold = 4.0,
r2_threshold = 0.6
)
# Replace NA outlier flags with FALSE (NAs arise from division by zero
# when a variant is in perfect LD with the lead variant)
slalom_outliers <- slalom_result$data$outliers
slalom_outliers[is.na(slalom_outliers)] <- FALSE
cat(sprintf("SLALOM: %d outliers out of %d variants (%.1f%%)\n",
sum(slalom_outliers),
length(slalom_outliers),
100 * mean(slalom_outliers)))## SLALOM: 1 outliers out of 608 variants (0.2%)
## SLALOM lead variant PIP: 0.0047
## 95% credible set size: 577
Run RAISS on SLALOM-flagged outliers
RAISS imputes z-scores for “unknown” variants using LD with “known” variants. First, we use SLALOM’s outlier flags to define which variants to re-impute — this is the natural SLALOM+RAISS pipeline.
# Helper: run RAISS given a set of outlier flags
run_raiss_on_outliers <- function(outlier_flags, z_vec, ref_df, LD_mat) {
known_idx <- which(!outlier_flags)
known_df <- data.frame(
chrom = ref_df$chrom[known_idx],
pos = ref_df$pos[known_idx],
variant_id = ref_df$variant_id[known_idx],
A1 = ref_df$A1[known_idx],
A2 = ref_df$A2[known_idx],
z = z_vec[known_idx],
stringsAsFactors = FALSE
)
result <- raiss(
ref_panel = ref_df,
known_zscores = known_df,
LD_matrix = LD_mat,
lamb = 0.01, rcond = 0.01,
R2_threshold = 0.6, minimum_ld = 5, verbose = FALSE
)
# Build combined z-score vector
combined_z <- z_vec
if (!is.null(result) && !is.null(result$result_nofilter)) {
raiss_df <- result$result_nofilter
for (i in which(outlier_flags)) {
vid <- ref_df$variant_id[i]
ri <- which(raiss_df$variant_id == vid)
if (length(ri) == 1 && !is.na(raiss_df$z[ri])) {
combined_z[i] <- raiss_df$z[ri]
}
}
}
return(list(z = combined_z, raiss_result = result))
}
cat(sprintf("SLALOM+RAISS: using %d SLALOM outliers as variants to impute\n",
sum(slalom_outliers)))## SLALOM+RAISS: using 1 SLALOM outliers as variants to impute
slalom_raiss <- run_raiss_on_outliers(slalom_outliers, z_sorted, ref_panel_df, LD_sorted)Run RAISS on DENTIST-flagged outliers
Next, we use DENTIST’s outlier flags with RAISS imputation. This lets us directly compare how DENTIST and RAISS impute the same set of variants.
dentist_outlier_flags <- dentist_sw$outlier
cat(sprintf("DENTIST+RAISS: using %d DENTIST outliers as variants to impute\n",
sum(dentist_outlier_flags)))## DENTIST+RAISS: using 119 DENTIST outliers as variants to impute
dentist_raiss <- run_raiss_on_outliers(dentist_outlier_flags, z_sorted, ref_panel_df, LD_sorted)Compare All Three Approaches
We now have three sets of z-scores:
- DENTIST: iterative SVD-truncated imputation (imputes all variants)
- SLALOM+RAISS: SLALOM flags outliers, RAISS imputes those
- DENTIST+RAISS: DENTIST flags outliers, RAISS imputes those
# Build "QC'd z-scores" for each pipeline:
# Each keeps original z for non-outliers, replaces outlier z with imputed values
# DENTIST QC'd: original z for non-outliers, DENTIST imputed z for outliers
dentist_qc_z <- z_sorted
dentist_qc_z[dentist_outlier_flags] <- dentist_sw$imputed_z[dentist_outlier_flags]
# SLALOM+RAISS QC'd: original z for non-outliers, RAISS imputed z for SLALOM outliers
slalom_raiss_z <- slalom_raiss$z
# DENTIST-flags+RAISS QC'd: original z for non-outliers, RAISS imputed z for DENTIST outliers
dentist_raiss_z <- dentist_raiss$z
cat("=== Three-Way Comparison ===\n")## === Three-Way Comparison ===
## Total variants: 608
# Outlier counts
cat("Outlier detection:\n")## Outlier detection:
cat(sprintf(" DENTIST outliers: %d (%.1f%%)\n",
sum(dentist_outlier_flags), 100 * mean(dentist_outlier_flags)))## DENTIST outliers: 119 (19.6%)
## SLALOM outliers: 1 (0.2%)
## Overlap: 0
# Correlation between QC'd z-score vectors
cat("QC'd z-score correlations (all variants):\n")## QC'd z-score correlations (all variants):
## DENTIST vs SLALOM+RAISS: 0.4917
## DENTIST vs DENTIST+RAISS: 0.9622
## SLALOM+RAISS vs DENTIST+RAISS: 0.5530
# Focus on DENTIST-flagged outliers (where imputation matters most)
if (sum(dentist_outlier_flags) > 1) {
d_out <- dentist_outlier_flags
cat("On DENTIST-flagged outlier variants:\n")
cat(sprintf(" DENTIST imputed vs RAISS imputed: r = %.4f\n",
cor(dentist_sw$imputed_z[d_out], dentist_raiss_z[d_out])))
cat(sprintf(" Mean |diff|: %.3f, Max |diff|: %.3f\n",
mean(abs(dentist_sw$imputed_z[d_out] - dentist_raiss_z[d_out])),
max(abs(dentist_sw$imputed_z[d_out] - dentist_raiss_z[d_out]))))
}## On DENTIST-flagged outlier variants:
## DENTIST imputed vs RAISS imputed: r = 0.7002
## Mean |diff|: 0.380, Max |diff|: 1.377
par(mfrow = c(1, 3))
# Panel 1: DENTIST QC'd z vs original
plot(z_sorted, dentist_qc_z, pch = 16, cex = 0.4, col = rgb(0, 0, 0, 0.3),
xlab = "Original z-score", ylab = "DENTIST QC'd z-score",
main = "DENTIST: original vs QC'd")
abline(0, 1, col = "red", lwd = 2)
if (sum(dentist_outlier_flags) > 0) {
points(z_sorted[dentist_outlier_flags], dentist_qc_z[dentist_outlier_flags],
pch = 16, cex = 0.6, col = "red")
legend("topleft", legend = c("Kept original", "Imputed (outlier)"),
col = c("black", "red"), pch = 16, bty = "n", cex = 0.9)
}
# Panel 2: DENTIST QC'd vs SLALOM+RAISS QC'd
plot(dentist_qc_z, slalom_raiss_z, pch = 16, cex = 0.4, col = rgb(0, 0, 0, 0.3),
xlab = "DENTIST QC'd z", ylab = "SLALOM+RAISS QC'd z",
main = "DENTIST vs SLALOM+RAISS (QC'd)")
abline(0, 1, col = "red", lwd = 2)
# Highlight DENTIST outliers (where DENTIST replaced z but SLALOM did not)
dentist_only <- dentist_outlier_flags & !slalom_outliers
if (sum(dentist_only) > 0) {
points(dentist_qc_z[dentist_only], slalom_raiss_z[dentist_only],
pch = 16, cex = 0.5, col = "orange")
}
if (sum(slalom_outliers) > 0) {
points(dentist_qc_z[slalom_outliers], slalom_raiss_z[slalom_outliers],
pch = 16, cex = 0.8, col = "purple")
}
legend("topleft",
legend = c("Agree", "DENTIST-only outlier", "SLALOM outlier"),
col = c("black", "orange", "purple"), pch = 16, bty = "n", cex = 0.8)
# Panel 3: DENTIST imputed vs RAISS imputed (on DENTIST outliers)
if (sum(dentist_outlier_flags) > 1) {
d_out <- dentist_outlier_flags
plot(dentist_sw$imputed_z[d_out], dentist_raiss_z[d_out],
pch = 16, cex = 0.6, col = "steelblue",
xlab = "DENTIST imputed z", ylab = "RAISS imputed z",
main = sprintf("Imputation comparison (n=%d outliers)",
sum(d_out)))
abline(0, 1, col = "red", lwd = 2)
legend("topleft",
legend = sprintf("r = %.3f",
cor(dentist_sw$imputed_z[d_out], dentist_raiss_z[d_out])),
bty = "n", cex = 1.2)
} else {
plot.new()
text(0.5, 0.5, "Too few outliers\nfor comparison", cex = 1.2)
}
Summary
The pecotmr package provides two complementary
approaches for GWAS summary statistics QC:
DENTIST-based (dentist_from_files /
dentist_single_window):
- Iterative SVD-truncated imputation across the full LD structure
- Flags outliers via chi-squared test on
(z_obs - z_imp)² / (1 - r²) - Produces imputed z-scores for all variants simultaneously
SLALOM + RAISS-based (slalom +
raiss):
- SLALOM identifies outliers via the DENTIST-S statistic relative to the lead variant
- RAISS imputes outlier z-scores using LD regression from non-outlier variants
- Provides per-variant imputation quality (R²) and LD scores for filtering
Both methods require an LD reference panel and perform allele QC as a preprocessing step. The choice between them depends on the analysis context: DENTIST is a self-contained QC tool, while SLALOM+RAISS integrates naturally with fine-mapping workflows where posterior inclusion probabilities and credible sets are also needed.
## R version 4.4.3 (2025-02-28)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS/LAPACK: /tmp/pixi/.pixi/envs/r44/lib/libopenblasp-r0.3.30.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] vroom_1.6.7 pecotmr_0.3.21
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.3 bigsnpr_1.12.21
## [4] dplyr_1.2.0 farver_2.1.2 viridis_0.6.5
## [7] fastmap_1.2.0 reshape_0.8.10 bigassertr_0.1.7
## [10] flock_0.7 digest_0.6.39 lifecycle_1.0.5
## [13] survival_3.8-6 magrittr_2.0.4 compiler_4.4.3
## [16] rlang_1.1.7 sass_0.4.10 rngtools_1.5.2
## [19] tools_4.4.3 yaml_2.3.12 data.table_1.17.8
## [22] knitr_1.51 doRNG_1.8.6.3 htmlwidgets_1.6.4
## [25] bit_4.6.0 plyr_1.8.9 RColorBrewer_1.1-3
## [28] rmio_0.4.0 coloc_5.2.3 withr_3.0.2
## [31] bigsparser_0.7.3 purrr_1.2.1 bigstatsr_1.6.2
## [34] BiocGenerics_0.52.0 desc_1.4.3 grid_4.4.3
## [37] stats4_4.4.3 susieR_0.15.51 future_1.69.0
## [40] ggplot2_3.5.2 MASS_7.3-65 globals_0.19.0
## [43] scales_1.4.0 iterators_1.0.14 cli_3.6.5
## [46] rmarkdown_2.30 crayon_1.5.3 ragg_1.5.0
## [49] generics_0.1.4 RcppParallel_5.1.11-1 otel_0.2.0
## [52] future.apply_1.20.2 tzdb_0.5.0 cachem_1.1.0
## [55] stringr_1.6.0 splines_4.4.3 zlibbioc_1.52.0
## [58] parallel_4.4.3 matrixStats_1.5.0 vctrs_0.7.1
## [61] Matrix_1.7-4 jsonlite_2.0.0 IRanges_2.40.0
## [64] hms_1.1.4 S4Vectors_0.44.0 mixsqp_0.3-54
## [67] bit64_4.6.0-1 irlba_2.3.7 listenv_0.10.0
## [70] systemfonts_1.3.1 foreach_1.5.2 tidyr_1.3.2
## [73] jquerylib_0.1.4 bigparallelr_0.3.2 snpStats_1.56.0
## [76] glue_1.8.0 parallelly_1.46.1 pkgdown_2.2.0
## [79] codetools_0.2-20 cowplot_1.2.0 stringi_1.8.7
## [82] gtable_0.3.6 doFuture_1.2.1 tibble_3.3.1
## [85] pillar_1.11.1 furrr_0.3.1 htmltools_0.5.9
## [88] R6_2.6.1 zigg_0.0.2 textshaping_1.0.4
## [91] doParallel_1.0.17 evaluate_1.0.5 lattice_0.22-9
## [94] readr_2.2.0 Rfast_2.1.5.2 bslib_0.10.0
## [97] Rcpp_1.1.1 gridExtra_2.3 xfun_0.56
## [100] fs_1.6.6 pkgconfig_2.0.3