Skip to contents

Overview

DENTIST (Detecting Errors iN analyses of summary staTISTics) performs quality control of GWAS summary statistics by iteratively imputing z-scores using linkage disequilibrium (LD) from a reference panel. Variants whose observed z-scores deviate significantly from their LD-imputed values are flagged as outliers.

The pecotmr package provides a pure R implementation of DENTIST that can be used in two ways:

  1. File-based input via dentist_from_files() — takes PLINK binary files and a summary statistics file, matching the interface of the original DENTIST C++ binary.
  2. Direct input via dentist_single_window() — takes z-scores and an LD matrix (or genotype matrix) directly, suitable for integration into analysis pipelines.

We also demonstrate how the alternative SLALOM + RAISS pipeline achieves comparable QC and imputation on the same data, giving users two complementary approaches for summary statistics quality control.

Toy Data

We use a toy dataset on chromosome 22 distributed with the package under inst/extdata/. The full dataset has 17,421 variants and 165 reference panel individuals; we subset to a small region for fast demonstration.

# Locate package data
# system.file() finds files within installed packages;
# during development, use the inst/extdata path directly
extdata_dir <- system.file("extdata", package = "pecotmr")
if (extdata_dir == "" || !file.exists(file.path(extdata_dir, "toy_ref.bed"))) {
  # Fallback for development: look relative to the package source root
  extdata_dir <- file.path(getwd(), "..", "inst", "extdata")
  if (!file.exists(file.path(extdata_dir, "toy_ref.bed"))) {
    extdata_dir <- file.path(getwd(), "inst", "extdata")
  }
}
bfile <- file.path(extdata_dir, "toy_ref")
sumstat_gz <- file.path(extdata_dir, "toy_summary.txt.gz")

# Load summary statistics (DENTIST format: SNP A1 A2 freq beta se p N)
sumstat <- as.data.frame(vroom(sumstat_gz, show_col_types = FALSE))
cat(sprintf("Full dataset: %d variants\n", nrow(sumstat)))
## Full dataset: 17421 variants
head(sumstat, 3)
##          SNP A1 A2   freq      beta       se         p     N
## 1 rs11089128  G  A 0.0804 -0.008047 0.014364 0.5753294 10000
## 2  rs7288972  C  T 0.1741 -0.025328 0.028768 0.3786304 10000
## 3 rs11167319  G  T 0.1339 -0.001455 0.023300 0.9502074 10000
# Load reference panel variant info
bim <- as.data.frame(vroom(paste0(bfile, ".bim"), col_names = c("chrom", "variant_id", "gd", "pos", "A1", "A2"),
                           show_col_types = FALSE))

# Select ~500 variants from a contiguous region for fast execution
region_start <- bim$pos[1]
region_end   <- region_start + 2e6
region_snps  <- bim$variant_id[bim$pos >= region_start & bim$pos <= region_end]
cat(sprintf("Selected %d variants in %.1f Mb region (pos %s - %s)\n",
            length(region_snps),
            (region_end - region_start) / 1e6,
            format(region_start, big.mark = ","),
            format(region_end, big.mark = ",")))
## Selected 608 variants in 2.0 Mb region (pos 14,560,203 - 16,560,203)

Case 1: File-Based DENTIST

The dentist_from_files() function mirrors the interface of the DENTIST C++ binary. It takes a summary statistics file path and a PLINK bfile prefix, then internally handles allele alignment, LD computation, and windowed imputation.

# Subset summary statistics to the selected region (in R, no plink needed)
sumstat_sub <- sumstat[sumstat$SNP %in% region_snps, ]
sub_sumstat <- tempfile(fileext = ".txt")
readr::write_tsv(sumstat_sub, sub_sumstat)
cat(sprintf("Subset: %d variants in summary stats\n", nrow(sumstat_sub)))
## Subset: 608 variants in summary stats
# Run DENTIST via file-based interface
# dentist_from_files() handles matching between summary stats and bfile internally
# nSample = NULL auto-detects reference panel size from .fam file
result_files <- dentist_from_files(
  gwas_summary = sub_sumstat,
  bfile        = bfile,
  nSample      = NULL,
  window_size  = 2000000,
  nIter        = 8,
  pValueThreshold = 5.0369e-08,
  propSVD      = 0.4,
  duprThreshold = 0.99,
  verbose      = TRUE
)
## Reading summary statistics...
##   608 variants read
## Reading reference panel bim...
## Rows: 17421 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): X2, X5, X6
## dbl (3): X1, X3, X4
## 
##  Use `spec()` to retrieve the full column specification for this data.
##  Specify the column types or set `show_col_types = FALSE` to quiet this message.
##   17421 variants in reference panel
## 
##   608 SNPs in common
## 
## Aligning alleles via allele_qc()...
## 
##   608 variants after allele QC (from 608 common)
## 
## Loading genotype data...
## 
## Using reference panel sample size: N = 165
## 
## Computing LD for 608 SNPs from 165 samples (291 missing genotypes)...
## 
## Running R DENTIST implementation...
## 
## 98 duplicated variants out of a total of 608 were found at r threshold of 0.995
## 
## Done: 608 variants tested, 119 outliers detected (19.6%)
dentist_result <- result_files$result
cat(sprintf("\nDENTIST result: %d variants, %d outliers (%.1f%%)\n",
            nrow(dentist_result),
            sum(dentist_result$outlier),
            100 * mean(dentist_result$outlier)))
## 
## DENTIST result: 608 variants, 119 outliers (19.6%)
head(dentist_result[, c("SNP", "original_z", "imputed_z", "rsq", "outlier", "outlier_stat")])
##          SNP  original_z   imputed_z       rsq outlier outlier_stat
## 1 rs11089128 -0.56021999 -0.26562145 0.8154480   FALSE    0.4702648
## 2  rs7288972 -0.88042269 -0.45830626 0.5456049   FALSE    0.3921307
## 3 rs11167319 -0.06244635 -0.37445027 0.4994233   FALSE    0.1944686
## 4  rs8138488  0.23059201  0.01738338 0.6515071   FALSE    0.1304415
## 5  rs2186521  0.37786517 -0.23479057 0.4656640   FALSE    0.7024551
## 6  rs2027649 -0.19730337 -0.47009297 0.7900872   FALSE    0.3545005

The dentist_from_files() return value is a list with three components:

  • result — data frame with outlier detection results (one row per input variant)
  • sum_stat — aligned summary statistics after allele QC
  • LD_mat — the LD correlation matrix used for imputation

Case 2: Single-Window DENTIST vs SLALOM + RAISS

For pipeline integration, dentist_single_window() accepts z-scores and an LD matrix directly. We compare this with the SLALOM + RAISS pipeline, which uses a different approach to achieve the same goal: identifying problematic variants and imputing corrected z-scores.

DENTIST performs iterative SVD-truncated imputation: it imputes all z-scores from LD, flags outliers based on the discrepancy statistic (z_obs - z_imp)² / (1 - r²), removes flagged variants, and repeats.

SLALOM computes the DENTIST-S statistic relative to the lead variant to flag outliers among variants in high LD. RAISS then imputes z-scores for a given set of variants using LD-based regression from the remaining variants.

Prepare aligned data

# Use the aligned data from dentist_from_files for consistency
# This ensures allele QC has already been applied
aligned_sumstat <- result_files$sum_stat
LD_mat <- result_files$LD_mat
n_samples <- nrow(vroom(paste0(bfile, ".fam"), col_names = FALSE, show_col_types = FALSE))

z_scores <- aligned_sumstat$z
n_snps <- length(z_scores)
cat(sprintf("Aligned data: %d variants, %d reference samples\n", n_snps, n_samples))
## Aligned data: 608 variants, 165 reference samples
# Build a position-sorted variant info data frame for RAISS
bim_full <- as.data.frame(vroom(paste0(bfile, ".bim"), col_names = c("chrom", "variant_id", "gd", "pos", "A1", "A2"),
                              show_col_types = FALSE))
bim_aligned <- bim_full[match(aligned_sumstat$SNP, bim_full$variant_id), ]

# Sort everything by position (required by RAISS)
pos_order <- order(bim_aligned$pos)
ref_panel_df <- data.frame(
  chrom      = bim_aligned$chrom[pos_order],
  pos        = bim_aligned$pos[pos_order],
  variant_id = bim_aligned$variant_id[pos_order],
  A1         = bim_aligned$A1[pos_order],
  A2         = bim_aligned$A2[pos_order],
  stringsAsFactors = FALSE
)
z_sorted <- z_scores[pos_order]
LD_sorted <- LD_mat[pos_order, pos_order]

Run DENTIST (single-window)

dentist_sw <- dentist_single_window(
  zScore   = z_sorted,
  R        = LD_sorted,
  nSample  = n_samples,
  pValueThreshold = 5.0369e-08,
  propSVD  = 0.4,
  nIter    = 8,
  duprThreshold = 0.99
)
## 98 duplicated variants out of a total of 608 were found at r threshold of 0.995
cat(sprintf("DENTIST single-window: %d variants, %d outliers (%.1f%%)\n",
            nrow(dentist_sw),
            sum(dentist_sw$outlier),
            100 * mean(dentist_sw$outlier)))
## DENTIST single-window: 608 variants, 119 outliers (19.6%)

Run SLALOM (QC)

SLALOM uses Approximate Bayes Factors for fine-mapping and the DENTIST-S statistic for outlier detection. It identifies variants whose z-scores are inconsistent with the lead variant’s LD pattern. Note that SLALOM focuses on variants in high LD with the lead variant (r² > threshold), so it typically flags fewer outliers than DENTIST’s genome-wide iterative approach.

slalom_result <- slalom(
  zScore = z_sorted,
  R      = LD_sorted,
  nlog10p_dentist_s_threshold = 4.0,
  r2_threshold = 0.6
)

# Replace NA outlier flags with FALSE (NAs arise from division by zero
# when a variant is in perfect LD with the lead variant)
slalom_outliers <- slalom_result$data$outliers
slalom_outliers[is.na(slalom_outliers)] <- FALSE
cat(sprintf("SLALOM: %d outliers out of %d variants (%.1f%%)\n",
            sum(slalom_outliers),
            length(slalom_outliers),
            100 * mean(slalom_outliers)))
## SLALOM: 1 outliers out of 608 variants (0.2%)
cat(sprintf("SLALOM lead variant PIP: %.4f\n", slalom_result$summary$max_pip))
## SLALOM lead variant PIP: 0.0047
cat(sprintf("95%% credible set size: %d\n", length(slalom_result$summary$cs_95)))
## 95% credible set size: 577

Run RAISS on SLALOM-flagged outliers

RAISS imputes z-scores for “unknown” variants using LD with “known” variants. First, we use SLALOM’s outlier flags to define which variants to re-impute — this is the natural SLALOM+RAISS pipeline.

# Helper: run RAISS given a set of outlier flags
run_raiss_on_outliers <- function(outlier_flags, z_vec, ref_df, LD_mat) {
  known_idx <- which(!outlier_flags)
  known_df <- data.frame(
    chrom      = ref_df$chrom[known_idx],
    pos        = ref_df$pos[known_idx],
    variant_id = ref_df$variant_id[known_idx],
    A1         = ref_df$A1[known_idx],
    A2         = ref_df$A2[known_idx],
    z          = z_vec[known_idx],
    stringsAsFactors = FALSE
  )
  result <- raiss(
    ref_panel     = ref_df,
    known_zscores = known_df,
    LD_matrix     = LD_mat,
    lamb = 0.01, rcond = 0.01,
    R2_threshold = 0.6, minimum_ld = 5, verbose = FALSE
  )
  # Build combined z-score vector
  combined_z <- z_vec
  if (!is.null(result) && !is.null(result$result_nofilter)) {
    raiss_df <- result$result_nofilter
    for (i in which(outlier_flags)) {
      vid <- ref_df$variant_id[i]
      ri <- which(raiss_df$variant_id == vid)
      if (length(ri) == 1 && !is.na(raiss_df$z[ri])) {
        combined_z[i] <- raiss_df$z[ri]
      }
    }
  }
  return(list(z = combined_z, raiss_result = result))
}

cat(sprintf("SLALOM+RAISS: using %d SLALOM outliers as variants to impute\n",
            sum(slalom_outliers)))
## SLALOM+RAISS: using 1 SLALOM outliers as variants to impute
slalom_raiss <- run_raiss_on_outliers(slalom_outliers, z_sorted, ref_panel_df, LD_sorted)

Run RAISS on DENTIST-flagged outliers

Next, we use DENTIST’s outlier flags with RAISS imputation. This lets us directly compare how DENTIST and RAISS impute the same set of variants.

dentist_outlier_flags <- dentist_sw$outlier
cat(sprintf("DENTIST+RAISS: using %d DENTIST outliers as variants to impute\n",
            sum(dentist_outlier_flags)))
## DENTIST+RAISS: using 119 DENTIST outliers as variants to impute
dentist_raiss <- run_raiss_on_outliers(dentist_outlier_flags, z_sorted, ref_panel_df, LD_sorted)

Compare All Three Approaches

We now have three sets of z-scores:

  1. DENTIST: iterative SVD-truncated imputation (imputes all variants)
  2. SLALOM+RAISS: SLALOM flags outliers, RAISS imputes those
  3. DENTIST+RAISS: DENTIST flags outliers, RAISS imputes those
# Build "QC'd z-scores" for each pipeline:
# Each keeps original z for non-outliers, replaces outlier z with imputed values

# DENTIST QC'd: original z for non-outliers, DENTIST imputed z for outliers
dentist_qc_z <- z_sorted
dentist_qc_z[dentist_outlier_flags] <- dentist_sw$imputed_z[dentist_outlier_flags]

# SLALOM+RAISS QC'd: original z for non-outliers, RAISS imputed z for SLALOM outliers
slalom_raiss_z <- slalom_raiss$z

# DENTIST-flags+RAISS QC'd: original z for non-outliers, RAISS imputed z for DENTIST outliers
dentist_raiss_z <- dentist_raiss$z

cat("=== Three-Way Comparison ===\n")
## === Three-Way Comparison ===
cat(sprintf("Total variants: %d\n\n", n_snps))
## Total variants: 608
# Outlier counts
cat("Outlier detection:\n")
## Outlier detection:
cat(sprintf("  DENTIST outliers: %d (%.1f%%)\n",
            sum(dentist_outlier_flags), 100 * mean(dentist_outlier_flags)))
##   DENTIST outliers: 119 (19.6%)
cat(sprintf("  SLALOM outliers:  %d (%.1f%%)\n",
            sum(slalom_outliers), 100 * mean(slalom_outliers)))
##   SLALOM outliers:  1 (0.2%)
overlap <- sum(dentist_outlier_flags & slalom_outliers)
cat(sprintf("  Overlap:          %d\n\n", overlap))
##   Overlap:          0
# Correlation between QC'd z-score vectors
cat("QC'd z-score correlations (all variants):\n")
## QC'd z-score correlations (all variants):
cat(sprintf("  DENTIST vs SLALOM+RAISS:       %.4f\n", cor(dentist_qc_z, slalom_raiss_z)))
##   DENTIST vs SLALOM+RAISS:       0.4917
cat(sprintf("  DENTIST vs DENTIST+RAISS:      %.4f\n", cor(dentist_qc_z, dentist_raiss_z)))
##   DENTIST vs DENTIST+RAISS:      0.9622
cat(sprintf("  SLALOM+RAISS vs DENTIST+RAISS: %.4f\n\n",
            cor(slalom_raiss_z, dentist_raiss_z)))
##   SLALOM+RAISS vs DENTIST+RAISS: 0.5530
# Focus on DENTIST-flagged outliers (where imputation matters most)
if (sum(dentist_outlier_flags) > 1) {
  d_out <- dentist_outlier_flags
  cat("On DENTIST-flagged outlier variants:\n")
  cat(sprintf("  DENTIST imputed vs RAISS imputed: r = %.4f\n",
              cor(dentist_sw$imputed_z[d_out], dentist_raiss_z[d_out])))
  cat(sprintf("  Mean |diff|: %.3f, Max |diff|: %.3f\n",
              mean(abs(dentist_sw$imputed_z[d_out] - dentist_raiss_z[d_out])),
              max(abs(dentist_sw$imputed_z[d_out] - dentist_raiss_z[d_out]))))
}
## On DENTIST-flagged outlier variants:
##   DENTIST imputed vs RAISS imputed: r = 0.7002
##   Mean |diff|: 0.380, Max |diff|: 1.377
par(mfrow = c(1, 3))

# Panel 1: DENTIST QC'd z vs original
plot(z_sorted, dentist_qc_z, pch = 16, cex = 0.4, col = rgb(0, 0, 0, 0.3),
     xlab = "Original z-score", ylab = "DENTIST QC'd z-score",
     main = "DENTIST: original vs QC'd")
abline(0, 1, col = "red", lwd = 2)
if (sum(dentist_outlier_flags) > 0) {
  points(z_sorted[dentist_outlier_flags], dentist_qc_z[dentist_outlier_flags],
         pch = 16, cex = 0.6, col = "red")
  legend("topleft", legend = c("Kept original", "Imputed (outlier)"),
         col = c("black", "red"), pch = 16, bty = "n", cex = 0.9)
}

# Panel 2: DENTIST QC'd vs SLALOM+RAISS QC'd
plot(dentist_qc_z, slalom_raiss_z, pch = 16, cex = 0.4, col = rgb(0, 0, 0, 0.3),
     xlab = "DENTIST QC'd z", ylab = "SLALOM+RAISS QC'd z",
     main = "DENTIST vs SLALOM+RAISS (QC'd)")
abline(0, 1, col = "red", lwd = 2)
# Highlight DENTIST outliers (where DENTIST replaced z but SLALOM did not)
dentist_only <- dentist_outlier_flags & !slalom_outliers
if (sum(dentist_only) > 0) {
  points(dentist_qc_z[dentist_only], slalom_raiss_z[dentist_only],
         pch = 16, cex = 0.5, col = "orange")
}
if (sum(slalom_outliers) > 0) {
  points(dentist_qc_z[slalom_outliers], slalom_raiss_z[slalom_outliers],
         pch = 16, cex = 0.8, col = "purple")
}
legend("topleft",
       legend = c("Agree", "DENTIST-only outlier", "SLALOM outlier"),
       col = c("black", "orange", "purple"), pch = 16, bty = "n", cex = 0.8)

# Panel 3: DENTIST imputed vs RAISS imputed (on DENTIST outliers)
if (sum(dentist_outlier_flags) > 1) {
  d_out <- dentist_outlier_flags
  plot(dentist_sw$imputed_z[d_out], dentist_raiss_z[d_out],
       pch = 16, cex = 0.6, col = "steelblue",
       xlab = "DENTIST imputed z", ylab = "RAISS imputed z",
       main = sprintf("Imputation comparison (n=%d outliers)",
                       sum(d_out)))
  abline(0, 1, col = "red", lwd = 2)
  legend("topleft",
         legend = sprintf("r = %.3f",
                          cor(dentist_sw$imputed_z[d_out], dentist_raiss_z[d_out])),
         bty = "n", cex = 1.2)
} else {
  plot.new()
  text(0.5, 0.5, "Too few outliers\nfor comparison", cex = 1.2)
}

Summary

The pecotmr package provides two complementary approaches for GWAS summary statistics QC:

DENTIST-based (dentist_from_files / dentist_single_window):

  • Iterative SVD-truncated imputation across the full LD structure
  • Flags outliers via chi-squared test on (z_obs - z_imp)² / (1 - r²)
  • Produces imputed z-scores for all variants simultaneously

SLALOM + RAISS-based (slalom + raiss):

  • SLALOM identifies outliers via the DENTIST-S statistic relative to the lead variant
  • RAISS imputes outlier z-scores using LD regression from non-outlier variants
  • Provides per-variant imputation quality (R²) and LD scores for filtering

Both methods require an LD reference panel and perform allele QC as a preprocessing step. The choice between them depends on the analysis context: DENTIST is a self-contained QC tool, while SLALOM+RAISS integrates naturally with fine-mapping workflows where posterior inclusion probabilities and credible sets are also needed.

## R version 4.4.3 (2025-02-28)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /tmp/pixi/.pixi/envs/r44/lib/libopenblasp-r0.3.30.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] vroom_1.6.7    pecotmr_0.3.21
## 
## loaded via a namespace (and not attached):
##   [1] tidyselect_1.2.1      viridisLite_0.4.3     bigsnpr_1.12.21      
##   [4] dplyr_1.2.0           farver_2.1.2          viridis_0.6.5        
##   [7] fastmap_1.2.0         reshape_0.8.10        bigassertr_0.1.7     
##  [10] flock_0.7             digest_0.6.39         lifecycle_1.0.5      
##  [13] survival_3.8-6        magrittr_2.0.4        compiler_4.4.3       
##  [16] rlang_1.1.7           sass_0.4.10           rngtools_1.5.2       
##  [19] tools_4.4.3           yaml_2.3.12           data.table_1.17.8    
##  [22] knitr_1.51            doRNG_1.8.6.3         htmlwidgets_1.6.4    
##  [25] bit_4.6.0             plyr_1.8.9            RColorBrewer_1.1-3   
##  [28] rmio_0.4.0            coloc_5.2.3           withr_3.0.2          
##  [31] bigsparser_0.7.3      purrr_1.2.1           bigstatsr_1.6.2      
##  [34] BiocGenerics_0.52.0   desc_1.4.3            grid_4.4.3           
##  [37] stats4_4.4.3          susieR_0.15.51        future_1.69.0        
##  [40] ggplot2_3.5.2         MASS_7.3-65           globals_0.19.0       
##  [43] scales_1.4.0          iterators_1.0.14      cli_3.6.5            
##  [46] rmarkdown_2.30        crayon_1.5.3          ragg_1.5.0           
##  [49] generics_0.1.4        RcppParallel_5.1.11-1 otel_0.2.0           
##  [52] future.apply_1.20.2   tzdb_0.5.0            cachem_1.1.0         
##  [55] stringr_1.6.0         splines_4.4.3         zlibbioc_1.52.0      
##  [58] parallel_4.4.3        matrixStats_1.5.0     vctrs_0.7.1          
##  [61] Matrix_1.7-4          jsonlite_2.0.0        IRanges_2.40.0       
##  [64] hms_1.1.4             S4Vectors_0.44.0      mixsqp_0.3-54        
##  [67] bit64_4.6.0-1         irlba_2.3.7           listenv_0.10.0       
##  [70] systemfonts_1.3.1     foreach_1.5.2         tidyr_1.3.2          
##  [73] jquerylib_0.1.4       bigparallelr_0.3.2    snpStats_1.56.0      
##  [76] glue_1.8.0            parallelly_1.46.1     pkgdown_2.2.0        
##  [79] codetools_0.2-20      cowplot_1.2.0         stringi_1.8.7        
##  [82] gtable_0.3.6          doFuture_1.2.1        tibble_3.3.1         
##  [85] pillar_1.11.1         furrr_0.3.1           htmltools_0.5.9      
##  [88] R6_2.6.1              zigg_0.0.2            textshaping_1.0.4    
##  [91] doParallel_1.0.17     evaluate_1.0.5        lattice_0.22-9       
##  [94] readr_2.2.0           Rfast_2.1.5.2         bslib_0.10.0         
##  [97] Rcpp_1.1.1            gridExtra_2.3         xfun_0.56            
## [100] fs_1.6.6              pkgconfig_2.0.3