DENTIST (Detecting Errors iN analyses of summary staTISTics) is a quality control tool for GWAS summary data. It uses linkage disequilibrium (LD) information from a reference panel to identify and correct problematic variants by comparing observed GWAS statistics to predicted values. It can detect errors in genotyping/imputation, allelic errors, and heterogeneity between GWAS and LD reference samples.
Usage
dentist(
sum_stat,
R = NULL,
X = NULL,
nSample = NULL,
window_size = 2e+06,
pValueThreshold = 5.0369e-08,
propSVD = 0.4,
gcControl = FALSE,
nIter = 10,
gPvalueThreshold = 0.05,
duprThreshold = 0.99,
ncpus = 1,
correct_chen_et_al_bug = TRUE,
min_dim = 2000
)Arguments
- sum_stat
A data frame containing summary statistics, including 'pos' or 'position' and 'z' or 'zscore' columns.
- R
Square LD correlation matrix. Provide either
RorX.- X
Genotype matrix (samples x SNPs). If provided, LD is computed via
compute_LD(X)andnSampledefaults tonrow(X).- nSample
The number of samples in the LD reference panel (NOT the GWAS sample size). This controls the SVD truncation rank K = min(idx_size, nSample) * propSVD. Required when
Ris provided; inferred fromXwhenXis provided.- window_size
The size of the window for dividing the genomic region. Default is 2000000.
- pValueThreshold
The p-value threshold for significance. Default is 5e-8.
- propSVD
The proportion of singular value decomposition (SVD) to use. Default is 0.4.
- gcControl
Logical indicating whether genomic control should be applied. Default is FALSE.
- nIter
The number of iterations for the Dentist algorithm. Default is 10.
- gPvalueThreshold
The genomic p-value threshold for significance. Default is 0.05.
- duprThreshold
The absolute correlation r value threshold to be considered duplicate. Default is 0.99.
- ncpus
The number of CPU cores to use for parallel processing. Default is 1.
- correct_chen_et_al_bug
Logical indicating whether to correct the Chen et al. bug. Default is TRUE.
- min_dim
Minimum number of SNPs per window. Default is 2000.
Value
A data frame containing the imputed result and detected outliers.
The returned data frame includes the following columns:
original_zThe original z-score values from the input
sum_stat.imputed_zThe imputed z-score values computed by the Dentist algorithm.
rsqThe coefficient of determination (R-squared) between original and imputed z-scores.
iter_to_correctThe number of iterations required to correct the z-scores, if applicable.
index_within_windowThe index of the observation within the window.
index_globalThe global index of the observation.
outlier_statThe computed statistical value based on the original and imputed z-scores and R-squared.
outlierA logical indicator specifying whether the observation is identified as an outlier based on the statistical test.
Details
Windowing uses the original DENTIST C++ binary's segmentingByDist algorithm
(implemented in segment_by_dist). The correct_chen_et_al_bug
parameter affects the iterative filtering in two ways:
Comparison between iteration index
tandnIter(explained in source code)The
!grouping_tmpoperator bug (explained in source code)