Quality Control for GWAS and QTL Summary Statistics

Overview

Regression-with-summary-statistics (RSS) methods — both fine-mapping and regularized regression — all require matching variants in a summary-statistics table to an LD reference panel. Two are frequently encountered in practice:

Allele alignment. The sumary statisticss and LD panel may flip the reference and alternate allele, encode strand-ambiguous variants inconsistently, or contain indels and duplicates that need filtering.
LD-summary-statistic mismatch. Even when alleles align, individual z-scores can be inconsistent with the LD pattern around them because of imputation errors, ancestry differences between the GWAS and LD panel, or per-variant sample-size differences.

The summaryStatsQc() function provides a convenient interface to performing many QC tasks to address these issues. It takes a GwasSumStats or QtlSumStats object (see the Building QtlSumStats and GwasSumStats objects vignette), and returns a new collection of the same class with updated entries and a populated qcInfo slot. Downstream pipelines require the qcInfo to be populated, so QC is mandatory before fine-mapping, TWAS-weight learning, colocalization, etc.

library(pecotmr)

Running QC

summaryStatsQc() operates on a GwasSumStats or QtlSumStats object See the Building QtlSumStats and GwasSumStats objects vignette for details on how to make these objects. In this example the object is GwasSumStats object called gws, but the interface is the same for QtlSumStats.

summaryStatsQc() provides many options for the user to tweak:

gws_qcd <- summaryStatsQc(
  gws,
  useDbsnpRefCheck      = FALSE,   # opt-in to MungeSumstats dbSNP checks
  removeIndels          = FALSE,
  removeStrandAmbiguous = TRUE,
  mafCutoff             = 0,       # 0 = off; > 0 requires MAF column
  infoCutoff            = 0,       # 0 = off; > 0 requires INFO column
  nCutoff               = 5,
  skipRegion            = NULL,    # e.g. MHC "chr6:25000000-35000000"
  zMismatchQc           = "slalom", # "none", "slalom", "dentist"
  alleleFlipKriging     = FALSE,    # opt-in pre-SLALOM LD-consistency check
  impute                = FALSE,    # RAISS imputation against ldSketch
  matchMinProp          = 0)

gws_qc is a new GwasSumStats of the same class, with cleaned entries and a populated qcInfo slot. The per-entry audit record is accessible via getQcInfo():

qc <- getQcInfo(gws_qcd)
str(qc$entryAudit[[1]])    # per-entry counts: variantsIn, mungeSumstatsDropped,
                           # keepVariantsDropped, skipRegionDropped, matchedAgainstSketch,
                           # krigingOutliersDropped, ldMismatchOutliersDropped, ...
qc$options                 # the curated knobs you passed (frozen for provenance)

A non-zero mafCutoff requires every entry to carry a MAF mcol; non-zero infoCutoff requires INFO; non-zero nCutoff requires N. A missing column with a non-zero cutoff will produce an error - if you don’t want a filter to run, leave its cutoff at the default value of 0.

Allele-flip QC: the SuSiE kriging prefilter

alleleFlipKriging = TRUE enables an optional prefilter that runs before the SLALOM / DENTIST z-mismatch step. Where SLALOM and DENTIST look for z-scores that are broadly inconsistent with the LD pattern, the kriging prefilter specifically targets allele-flip / LD-mismatch outliers — variants whose observed z-score disagrees with the value the LD structure predicts for them from their neighbours.

It uses susieR’s kriging RSS diagnostic:

susieR::estimate_s_rss() estimates the overall LD-mismatch scale s between the harmonized z-scores and the reference LD matrix R.
susieR::kriging_rss(z, R, n, s) computes, for each variant, the leave-one-out conditional distribution of z_i given all the other z-scores under the LD model — i.e. the z-score the LD panel predicts for that variant.
The standardized residual between observed and predicted (z_std_diff) is ~N(0, 1) when the z-scores and LD panel agree. A large residual flags an allele-flip / LD-mismatch outlier; variants below a two-sided p-value threshold (default 5e-8) are dropped. The number removed is recorded in the qcInfo audit as krigingOutliersDropped.

Because it needs a per-region LD matrix and the kriging diagnostic, this is an RSS-only check: it draws R from the collection’s ldSketch and requires a susieR that exports kriging_rss() / estimate_s_rss() (otherwise it errors — which is why it is off by default).

The same diagnostic is exposed standalone as krigingOutlierQc() when you want to inspect the per-variant predicted z, residual, statistic, and p-value directly:

kr <- krigingOutlierQc(zScore = z, R = ld, n = 10000)
kr$outlier              # logical vector: TRUE = flagged as an outlier
head(kr$diagnostics)    # variant_id, z, predicted, residual, statistic,
                        # p_value, outlier

Kriging and SLALOM/DENTIST are complementary and can be combined: the kriging prefilter removes allele-flip outliers up front, then zMismatchQc handles the remaining LD-mismatch variants.

LD mismatch correction with SLALOM and DENTIST

summaryStatsQc(zMismatchQc = ...) dispatches to one of two LD-mismatch detectors. They differ in how they decide whether a variant’s z-score is inconsistent with the LD pattern around it.

SLALOM (Kanai et al. 2022, the recommended default once you turn zMismatchQc on):

For each variant, compares its z-score against the lead variant’s z-score scaled by their LD: a large (z_i − r_{i,lead} · z_lead) flags the variant as inconsistent.
Restricted to variants in high LD with the lead (r2Threshold).
Stable: results do not depend on window-size parameters.

DENTIST (Chen et al. 2021):

Iteratively imputes each z-score from all other z-scores via SVD- truncated LD regression, then flags large (z_obs − z_imp)^2 / (1 − r²) values as outliers.
Operates on windows of the chromosome; the window size noticeably affects how many outliers it reports.

Both methods are available directly as slalom() / dentist() / dentistSingleWindow() if you want to run them outside the summaryStatsQc() pipeline (typically to tune parameters or to inspect the per-variant intermediate output).

Dentist combines imputation and LD mismatch correction

DENTIST’s outlier rate is sensitive to its window size. On a typical ∼17k-variant test region we observe:

`min_dim`	Approx outlier rate
2,000	∼8 %
6,000	∼1.5 %
20,000	∼0.6 %

For this reason SLALOM is the recommended default when you turn zMismatchQc on. Use DENTIST when you specifically want its iterative imputation output, or when you are reproducing a DENTIST-based pipeline.

Combining LD mismatch correction and imputation with SLALOM and RAISS

SLALOM only flags outliers; it does not impute them. To re-impute the flagged variants from the surviving ones, pair SLALOM with RAISS (a LD-regression imputer). summaryStatsQc(zMismatchQc = "slalom", impute = TRUE) does this for you. The standalone version is exposed as raiss(); the full set of RAISS knobs lives under the imputeOpts argument to summaryStatsQc():

gws_qcd <- summaryStatsQc(
  gws,
  zMismatchQc = "slalom",
  impute      = TRUE,
  imputeOpts  = list(rcond = 0.01, r2Threshold = 0.6,
                     minimumLd = 5, lamb = 0.01))

Next steps

QtlSumStats and GwasSumStats are direct inputs to fineMappingPipeline(), twasWeightsPipeline() (QtlSumStats only) and colocBoostPipeline()— see the Fine-mapping with pecotmr, Learning TWAS weights with pecotmr and Multi-trait colocalization with ColocBoost vignettes.

sessionInfo()

## R version 4.5.3 (2026-03-11)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /home/runner/work/pecotmr/pecotmr/.pixi/envs/default/lib/libopenblasp-r0.3.33.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] pecotmr_0.6.7
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1            dplyr_1.2.1                
##  [3] farver_2.1.2                Biostrings_2.78.0          
##  [5] S7_0.2.2                    bitops_1.0-9               
##  [7] fastmap_1.2.0               reshape_0.8.10             
##  [9] mathjaxr_2.0-0              digest_0.6.39              
## [11] lifecycle_1.0.5             magrittr_2.0.5             
## [13] compiler_4.5.3              rlang_1.3.0                
## [15] sass_0.4.10                 tools_4.5.3                
## [17] yaml_2.3.12                 knitr_1.51                 
## [19] S4Arrays_1.10.1             htmlwidgets_1.6.4          
## [21] bit_4.6.0                   DelayedArray_0.36.0        
## [23] plyr_1.8.9                  RColorBrewer_1.1-3         
## [25] abind_1.4-8                 BiocParallel_1.44.0        
## [27] purrr_1.2.2                 numDeriv_2016.8-1.1        
## [29] BiocGenerics_0.56.0         desc_1.4.3                 
## [31] grid_4.5.3                  stats4_4.5.3               
## [33] susieR_0.16.4               ggplot2_4.0.3              
## [35] scales_1.4.0                MASS_7.3-66                
## [37] SummarizedExperiment_1.40.0 cli_3.6.6                  
## [39] rmarkdown_2.31              metafor_5.0-1              
## [41] crayon_1.5.3                ragg_1.5.2                 
## [43] generics_0.1.4              otel_0.2.0                 
## [45] RcppParallel_5.1.11-2       tzdb_0.5.0                 
## [47] cachem_1.1.0                stringr_1.6.0              
## [49] metadat_1.6-0               parallel_4.5.3             
## [51] XVector_0.50.0              matrixStats_1.5.0          
## [53] vctrs_0.7.3                 Matrix_1.7-5               
## [55] jsonlite_2.0.0              IRanges_2.44.0             
## [57] hms_1.1.4                   S4Vectors_0.48.0           
## [59] bit64_4.8.2                 mixsqp_0.3-54              
## [61] irlba_2.3.7                 systemfonts_1.3.2          
## [63] tidyr_1.3.2                 jquerylib_0.1.4            
## [65] glue_1.8.1                  pkgdown_2.2.1              
## [67] codetools_0.2-20            stringi_1.8.7              
## [69] gtable_0.3.6                GenomicRanges_1.62.1       
## [71] quadprog_1.5-8              tibble_3.3.1               
## [73] pillar_1.11.1               htmltools_0.5.9            
## [75] Seqinfo_1.0.0               R6_2.6.1                   
## [77] zigg_0.0.2                  textshaping_1.0.5          
## [79] vroom_1.7.1                 evaluate_1.0.5             
## [81] lattice_0.22-9              Biobase_2.70.0             
## [83] readr_2.2.0                 Rsamtools_2.26.0           
## [85] tictoc_1.2.1                Rfast_2.1.5.2              
## [87] bslib_0.11.0                Rcpp_1.1.2                 
## [89] SparseArray_1.10.8          nlme_3.1-170               
## [91] xfun_0.60                   fs_2.1.0                   
## [93] MatrixGenerics_1.22.0       pkgconfig_2.0.3

pecotmr authors