Skip to contents

Takes the same file-based inputs as the DENTIST C++ binary (summary stats file + PLINK bfile) and runs the R dentist implementation. This allows direct comparison between the R implementation and the C++ binary on identical data.

Usage

dentist_from_files(
  gwas_summary,
  bfile,
  nSample = NULL,
  window_size = 2e+06,
  pValueThreshold = 5.0369e-08,
  propSVD = 0.4,
  gcControl = FALSE,
  nIter = 10,
  gPvalueThreshold = 0.05,
  duprThreshold = 0.99,
  ncpus = 1,
  correct_chen_et_al_bug = TRUE,
  min_dim = 2000,
  use_gcta_LD = FALSE,
  verbose = TRUE
)

Arguments

gwas_summary

Path to the GWAS summary statistics file (DENTIST format: tab-separated, 8 columns with header: SNP A1 A2 freq beta se p N). May be gzipped.

bfile

PLINK binary file prefix (expects .bed/.bim/.fam files).

nSample

Number of samples in the LD reference panel. If NULL (recommended), uses the reference panel size from the genotype matrix. This controls the SVD truncation rank K = min(idx_size, nSample) * propSVD. Note: this is the reference panel size, NOT the GWAS sample size. Default is NULL.

window_size

Window size in base pairs. Default is 2000000.

pValueThreshold

P-value threshold for outlier detection. Default is 5.0369e-8.

propSVD

SVD truncation proportion. Default is 0.4.

gcControl

Logical; apply genomic control. Default is FALSE.

nIter

Number of QC iterations. Default is 10.

gPvalueThreshold

GWAS p-value threshold for grouping. Default is 0.05.

duprThreshold

LD r-squared threshold for duplicate detection. Default is 0.99.

ncpus

Number of CPU threads. Default is 1.

correct_chen_et_al_bug

Logical; correct known bugs in original DENTIST. Default is TRUE.

min_dim

Minimum number of SNPs per window. Default is 2000.

use_gcta_LD

Logical; use GCTA-style LD computation and raw B-allele counts to match the DENTIST binary's exact floating-point behavior. Requires snpStats. Default is FALSE; set to TRUE when comparing against the binary.

verbose

Logical; print progress messages. Default is TRUE.

Value

A list with components:

result

Data frame from dentist with outlier detection results.

sum_stat

Aligned summary statistics data frame (with SNP names and positions).

LD_mat

The LD correlation matrix used.

Details

This function reuses existing package utilities for file I/O, allele QC, and LD computation: read_bim for PLINK bim files, allele_qc for allele matching/flipping, load_genotype_region for genotype loading, and compute_LD (from misc.R) for LD matrix computation with mean imputation and Rfast::cora when available.

This function performs the full pipeline:

  1. Reads the summary statistics file via read_dentist_sumstat.

  2. Reads the PLINK bim file via read_bim and matches SNPs by ID to obtain chromosome and position information.

  3. Aligns alleles using allele_qc, which handles strand flips, allele swaps, and z-score sign flipping.

  4. Loads genotype data via load_genotype_region and computes the LD correlation matrix via compute_LD (mean imputation + Rfast::cora).

  5. Calls dentist with the aligned data.

The result includes the aligned summary statistics and LD matrix so they can be reused for further analysis or debugging.