Run R DENTIST Implementation on DENTIST-format Input Files
Source:R/dentist_qc.R
dentist_from_files.RdTakes the same file-based inputs as the DENTIST C++ binary (summary stats file +
PLINK bfile) and runs the R dentist implementation. This allows
direct comparison between the R implementation and the C++ binary on identical data.
Usage
dentist_from_files(
gwas_summary,
bfile,
nSample = NULL,
window_size = 2e+06,
pValueThreshold = 5.0369e-08,
propSVD = 0.4,
gcControl = FALSE,
nIter = 10,
gPvalueThreshold = 0.05,
duprThreshold = 0.99,
ncpus = 1,
correct_chen_et_al_bug = TRUE,
min_dim = 2000,
use_gcta_LD = FALSE,
verbose = TRUE
)Arguments
- gwas_summary
Path to the GWAS summary statistics file (DENTIST format: tab-separated, 8 columns with header: SNP A1 A2 freq beta se p N). May be gzipped.
- bfile
PLINK binary file prefix (expects .bed/.bim/.fam files).
- nSample
Number of samples in the LD reference panel. If NULL (recommended), uses the reference panel size from the genotype matrix. This controls the SVD truncation rank K = min(idx_size, nSample) * propSVD. Note: this is the reference panel size, NOT the GWAS sample size. Default is NULL.
- window_size
Window size in base pairs. Default is 2000000.
- pValueThreshold
P-value threshold for outlier detection. Default is 5.0369e-8.
- propSVD
SVD truncation proportion. Default is 0.4.
- gcControl
Logical; apply genomic control. Default is FALSE.
- nIter
Number of QC iterations. Default is 10.
- gPvalueThreshold
GWAS p-value threshold for grouping. Default is 0.05.
- duprThreshold
LD r-squared threshold for duplicate detection. Default is 0.99.
- ncpus
Number of CPU threads. Default is 1.
- correct_chen_et_al_bug
Logical; correct known bugs in original DENTIST. Default is TRUE.
- min_dim
Minimum number of SNPs per window. Default is 2000.
- use_gcta_LD
Logical; use GCTA-style LD computation and raw B-allele counts to match the DENTIST binary's exact floating-point behavior. Requires snpStats. Default is FALSE; set to TRUE when comparing against the binary.
- verbose
Logical; print progress messages. Default is TRUE.
Value
A list with components:
- result
Data frame from
dentistwith outlier detection results.- sum_stat
Aligned summary statistics data frame (with SNP names and positions).
- LD_mat
The LD correlation matrix used.
Details
This function reuses existing package utilities for file I/O, allele QC, and LD
computation: read_bim for PLINK bim files, allele_qc
for allele matching/flipping, load_genotype_region for genotype loading,
and compute_LD (from misc.R) for LD matrix computation with mean imputation
and Rfast::cora when available.
This function performs the full pipeline:
Reads the summary statistics file via
read_dentist_sumstat.Reads the PLINK bim file via
read_bimand matches SNPs by ID to obtain chromosome and position information.Aligns alleles using
allele_qc, which handles strand flips, allele swaps, and z-score sign flipping.Loads genotype data via
load_genotype_regionand computes the LD correlation matrix viacompute_LD(mean imputation + Rfast::cora).Calls
dentistwith the aligned data.
The result includes the aligned summary statistics and LD matrix so they can be reused for further analysis or debugging.