DENTIST (Detecting Errors iN analyses of summary staTISTics) is a quality control tool for GWAS summary data. It uses linkage disequilibrium (LD) information from a reference panel to identify and correct problematic variants by comparing observed GWAS statistics to predicted values. It can detect errors in genotyping/imputation, allelic errors, and heterogeneity between GWAS and LD reference samples.
Usage
dentist(
sum_stat,
R = NULL,
X = NULL,
nSample = NULL,
window_size = 2e+06,
window_mode = c("distance", "count"),
pValueThreshold = 5.0369e-08,
propSVD = 0.4,
gcControl = FALSE,
nIter = 10,
gPvalueThreshold = 0.05,
duprThreshold = 0.99,
ncpus = 1,
correct_chen_et_al_bug = TRUE,
min_dim = 2000
)Arguments
- sum_stat
A data frame containing summary statistics, including 'pos' or 'position' and 'z' or 'zscore' columns.
- R
Square LD correlation matrix. Provide either
RorX.- X
Genotype matrix (samples x SNPs). If provided, LD is computed via
compute_LD(X)andnSampledefaults tonrow(X).- nSample
The number of samples in the LD reference panel (NOT the GWAS sample size). This controls the SVD truncation rank K = min(idx_size, nSample) * propSVD. Required when
Ris provided; inferred fromXwhenXis provided.- window_size
The size of the window for dividing the genomic region in distance mode (base pairs). Default is 2000000 (2 Mb). Only used when
window_mode = "distance".- window_mode
Character string specifying the windowing strategy:
"distance"(default) creates windows by physical distance usingsegment_by_dist(C++--wind-dist), and"count"creates windows by variant count usingsegment_by_count(C++--wind).- pValueThreshold
The p-value threshold for significance. Default is 5e-8.
- propSVD
The proportion of singular value decomposition (SVD) to use. Default is 0.4.
- gcControl
Logical indicating whether genomic control should be applied. Default is FALSE.
- nIter
The number of iterations for the Dentist algorithm. Default is 10.
- gPvalueThreshold
The genomic p-value threshold for significance. Default is 0.05.
- duprThreshold
The absolute correlation r value threshold to be considered duplicate. Default is 0.99.
- ncpus
The number of CPU cores to use for parallel processing. Default is 1.
- correct_chen_et_al_bug
Logical indicating whether to correct the Chen et al. bug. Default is TRUE.
- min_dim
In distance mode: minimum number of SNPs per block (default 2000). In count mode: the number of variants per window (i.e., the window size).
Value
A data frame containing the imputed result and detected outliers.
The returned data frame includes the following columns:
original_zThe original z-score values from the input
sum_stat.imputed_zThe imputed z-score values computed by the Dentist algorithm.
rsqThe coefficient of determination (R-squared) between original and imputed z-scores.
iter_to_correctThe number of iterations required to correct the z-scores, if applicable.
index_within_windowThe index of the observation within the window.
index_globalThe global index of the observation.
outlier_statThe computed statistical value based on the original and imputed z-scores and R-squared.
outlierA logical indicator specifying whether the observation is identified as an outlier based on the statistical test.
Details
Windowing supports two modes matching the original DENTIST C++ binary:
"distance"(default): Uses thesegmentingByDistalgorithm (C++--wind-dist), implemented insegment_by_dist. Windows span a fixed physical distance (window_sizebp)."count": Uses thesegmentedQCedalgorithm (C++--wind), implemented insegment_by_count. Windows contain a fixed number of variants (min_dim). Useful when regions have sparse variants where distance-based windows would create windows with too few variants.
The correct_chen_et_al_bug parameter affects the iterative filtering
in two ways:
Comparison between iteration index
tandnIter(explained in source code)The
!grouping_tmpoperator bug (explained in source code)