Skip to contents

DENTIST (Detecting Errors iN analyses of summary staTISTics) is a quality control tool for GWAS summary data. It uses linkage disequilibrium (LD) information from a reference panel to identify and correct problematic variants by comparing observed GWAS statistics to predicted values. It can detect errors in genotyping/imputation, allelic errors, and heterogeneity between GWAS and LD reference samples.

Usage

dentist(
  sum_stat,
  R = NULL,
  X = NULL,
  nSample = NULL,
  window_size = 2e+06,
  pValueThreshold = 5.0369e-08,
  propSVD = 0.4,
  gcControl = FALSE,
  nIter = 10,
  gPvalueThreshold = 0.05,
  duprThreshold = 0.99,
  ncpus = 1,
  correct_chen_et_al_bug = TRUE,
  min_dim = 2000
)

Arguments

sum_stat

A data frame containing summary statistics, including 'pos' or 'position' and 'z' or 'zscore' columns.

R

Square LD correlation matrix. Provide either R or X.

X

Genotype matrix (samples x SNPs). If provided, LD is computed via compute_LD(X) and nSample defaults to nrow(X).

nSample

The number of samples in the LD reference panel (NOT the GWAS sample size). This controls the SVD truncation rank K = min(idx_size, nSample) * propSVD. Required when R is provided; inferred from X when X is provided.

window_size

The size of the window for dividing the genomic region. Default is 2000000.

pValueThreshold

The p-value threshold for significance. Default is 5e-8.

propSVD

The proportion of singular value decomposition (SVD) to use. Default is 0.4.

gcControl

Logical indicating whether genomic control should be applied. Default is FALSE.

nIter

The number of iterations for the Dentist algorithm. Default is 10.

gPvalueThreshold

The genomic p-value threshold for significance. Default is 0.05.

duprThreshold

The absolute correlation r value threshold to be considered duplicate. Default is 0.99.

ncpus

The number of CPU cores to use for parallel processing. Default is 1.

correct_chen_et_al_bug

Logical indicating whether to correct the Chen et al. bug. Default is TRUE.

min_dim

Minimum number of SNPs per window. Default is 2000.

Value

A data frame containing the imputed result and detected outliers.

The returned data frame includes the following columns:

original_z

The original z-score values from the input sum_stat.

imputed_z

The imputed z-score values computed by the Dentist algorithm.

rsq

The coefficient of determination (R-squared) between original and imputed z-scores.

iter_to_correct

The number of iterations required to correct the z-scores, if applicable.

index_within_window

The index of the observation within the window.

index_global

The global index of the observation.

outlier_stat

The computed statistical value based on the original and imputed z-scores and R-squared.

outlier

A logical indicator specifying whether the observation is identified as an outlier based on the statistical test.

Details

Windowing uses the original DENTIST C++ binary's segmentingByDist algorithm (implemented in segment_by_dist). The correct_chen_et_al_bug parameter affects the iterative filtering in two ways:

  1. Comparison between iteration index t and nIter (explained in source code)

  2. The !grouping_tmp operator bug (explained in source code)

Examples

# Example usage of dentist
dentist(sum_stat, R = LD_mat, nSample = nSample)
#> Error: object 'LD_mat' not found