Genotype VCF File Quality Control

Genotype VCF File Quality Control#

This implements some recommendations from UK Biobank on sequence data quality control.

Description#

A major challenge in biomedical research is the quality control (QC) of sequencing data. False positive variant calls can hinder the ability to detect disease associated variants or introduce spurious associations, therefore the need for a rigorous QC. Our pipeline focuses on QC after the variant calling stage and requires project Variant Calling Format (pVCF) as input files. We have defined default thresholds for genotype and variant-level hard filtering based on recommendations from the UK Biobank team and a thorough review of the literature [cf. Carson et al. BMC Bioinformatics (2014),cf. Lek et al. Nature (2016),cf. Szustakowski et al. Nature Genetics (2021)]. Bcftools is used in our QC steps. We first handle multi-allelic sites by splitting them into bi-allelic records. We include an optional workflow to keep only bi-allelic sites in the data. Variants are then annotated based on dbSNP data. Genotypes are kept if they have a Genotype Depth (DP) >= 10 and a Genotype Quality (GQ) >= 20. Variants are included if at least one sample has an allelic balance (AB) >= 0.15 for Single Nucleotide Variants (SNVs) and AB>=0.2 for indels, variant missigness is below 20% and Hardy-Weinberg Equilibrium p-value is > 1e-08. Allele balance is calculated for heterozygotes as the number of bases supporting the least-represented allele over the total number of base observations. Output summary statistics, such as transistion/transversion ratios (TS/TV ratio) are calculated to determine the effectiveness of QC.

Default Parameters: VCF QC Filters#

Genotype depth filters: For WES data, UK Biobank recommends SNPs DP>10 and Indels DP>10 for indels. However we think for WGS we can be less stringent, or simply rely on GQ. Users can set it to 1 eg, --DP 1 --DP-indel 1
Genotype quality GQ>20.
At least one sample per site passed the allele balance threshold >= 0.15 for SNPs and >=0.20 for indels (heterozygous variants).
- Allele balance is calculated for heterozygotes as the number of bases supporting the least-represented allele over the total number of base observations.
This module also allows for filtering by HWE and missingness although by default we don’t filter on them.

Filtering are done with bcftools. Here is a useful cheatsheet from github user @elowy01.

A note on TS/TV summary from VCF genotype data#

bcftools stats command provides useful summary statistics including TS/TV ratio, which is routinely used as a quality measure of variant calls. With dbSNP based annotation of novel and known variants, bcftools can compute TS/TV for novel and known variants at variant level, and at sample level. It should be noted that variant level TS/TV does not take sample genotype into consideration – it simply counts the TS and TV event for observed SNPs in the data. Other tools, such as snpsift, implements variant level TS/TV by counting TS and TV events in sample genotypes and compute the ratio after summing up TS and TV across all samples. See here some discussions on this issue. We provide these TS/TV calculations before and after QC but users should be aware of the difference when interpreting the results.

Default Parameters: Resource Usage#

Memory: Usually a whole genome VCF.gz file has the size of 200+GB, after testing, a minimum of 60GB of mem is requried.
Walltimes: For every hour qc_1 or qc_2 can process ~14G of data. The default is set to be 24h, corresponding to ~300GB of input. Please set the –walltime parameter according to the size of your input files.

Input Files#

File	Produced by	Used as
`input/genotype/protocol_example.genotype.chr*.vcf.gz` (+ `.tbi`)	upstream genotype calling	per-chromosome target VCFs to QC
`input/genotype/protocol_example.genotype.vcf_list.txt`	manifest (2 columns: chrom number, VCF path)	list of VCFs for parallel QC
`input/reference_data/00-All.add_chr.variants.gz` (+ `.tbi`)	dbSNP, `chr`-renamed	dbSNP variants for known/novel annotation
`input/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta` (+ `.fai`)	reference genome	reference for left-normalization / REF check

Note: the toy protocol_example.genotype.* VCFs carry only the GT FORMAT field (no DP/GQ/AB), so QC is run with --skip_vcf_header_filtering True (see Step 3).

Output Files#

File	Produced by	Description
`output/vcf_qc/protocol_example.genotype.chr*.leftnorm.vcf.gz`	`qc`	left-normalized, variant-ID-assigned, dbSNP-annotated VCF
`output/vcf_qc/protocol_example.genotype.chr*.leftnorm.{known,novel}_variant.snipsift_tstv`	`qc`	Ts/Tv summary statistics (known vs novel variants)
`output/vcf_qc/protocol_example.genotype.chr*.leftnorm.{known,novel}_variant_sumstats`	`qc`	per-file variant summary statistics

The QC-ed genotype VCFs can be converted to PLINK format with vcf_to_plink in genotype_formatting.ipynb.

Minimal Working Example#

The steps below run on the toy protocol_example genotype data. Steps 1 and 2 are optional helpers; Step 3 is the main QC workflow.

Steps#

Step 1. Rename Chromosomes (optional)#

Add a chr prefix to chromosome names so they match the reference fasta. Timing: ~40 min on full data.

The toy genotypes are already chr-prefixed, so to demonstrate rename_chrs we ship a small derived input protocol_example.genotype.rawchr.chr22.vcf.gz whose chromosome is named 22 (no chr). Running rename_chrs adds the prefix back, writing protocol_example.genotype.rawchr.add_chr.vcf.gz next to the input.

sos run pipeline/VCF_QC.ipynb rename_chrs \
    --genoFile input/genotype/protocol_example.genotype.rawchr.chr22.vcf.gz \
    --cwd output/vcf_qc

Extracts CHROM/POS/ID/REF/ALT from a VCF into the compact .variants.gz dbSNP table used by the QC step. Here we run it on a toy genotype VCF to produce protocol_example.genotype.chr22.variants.gz.

Annotates against the dbSNP database in VCF format.

sos run pipeline/VCF_QC.ipynb dbsnp_annotate \
    --genoFile input/genotype/protocol_example.genotype.chr22.vcf.gz \
    --cwd output/vcf_qc

Step 2. Merge separated bed files into one#

Converting VCF to PLINK keeping only the ROSMAP samples.

sos run pipeline/genotype_formatting.ipynb vcf_to_plink \
    --genoFile `ls input/genotype/protocol_example.genotype.chr*.vcf.gz` \
    --cwd output/plink/

sos run pipeline/genotype_formatting.ipynb merge_plink \
    --genoFile `ls output/plink/protocol_example.genotype.chr*.bed` \
    --name protocol_example.genotype.merged \
    --cwd output/plink/

Step 3. Quality Control#

The main QC workflow: split multi-allelic sites, left-normalize indels, assign variant IDs, and annotate known/novel variants against dbSNP.

Run on a single chromosome (chr22) below, or in parallel over all chromosomes.

sos run pipeline/VCF_QC.ipynb qc \
    --genoFile input/genotype/protocol_example.genotype.chr22.vcf.gz \
    --dbsnp-variants input/reference_data/00-All.add_chr.variants.gz \
    --reference-genome input/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --cwd output/vcf_qc \
    --skip_vcf_header_filtering True

To run in parallel for all genotype data listed in protocol_example.genotype.vcf_list.txt:

Note on --skip_vcf_header_filtering True: the toy protocol_example.genotype.* VCFs contain only the GT (genotype) FORMAT field. The default depth/quality filtering relies on the DP FORMAT tag, which is absent here, so --skip_vcf_header_filtering True runs left-normalization, variant-ID assignment and dbSNP annotation while skipping the DP-based filtering — the appropriate path for genotype-only VCFs. To see the default path (which does apply the DP/GQ/AD filters), use the protocol_example.genotype.withfmt.chr22.vcf.gz file in Step 3b below, which carries those tags.

sos run pipeline/VCF_QC.ipynb qc \
    --genoFile input/genotype/protocol_example.genotype.vcf_list.txt \
    --dbsnp-variants input/reference_data/00-All.add_chr.variants.gz \
    --reference-genome input/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --cwd output/vcf_qc \
    --skip_vcf_header_filtering True \
    -j 2

Step 3b — Quality Control on data with DP/GQ/AD tags (default path)#

The genotype VCFs above are genotype-only (GT), so QC is run with --skip_vcf_header_filtering True to bypass the depth/quality filters. To demonstrate the default QC path (which filters on read depth DP, genotype quality GQ, and allele balance from AD), we ship protocol_example.genotype.withfmt.chr22.vcf.gz. Its DP/GQ/AD/AB values are synthetic placeholders (seed=22), not real measurements, and exist only to exercise the default filters.

sos run pipeline/VCF_QC.ipynb qc \
    --genoFile input/genotype/protocol_example.genotype.withfmt.chr22.vcf.gz \
    --dbsnp-variants input/reference_data/00-All.add_chr.variants.gz \
    --reference-genome input/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --cwd output/vcf_qc_withfmt \
    -j 2

Producing the following results (toy data, per chromosome):

In the toy protocol_example data almost all variants are novel (absent from the dbSNP 00-All set), so the known_variant Ts/Tv files are typically empty and the informative ratios come from the novel_variant files.
Example total Ts/Tv for novel variants: chr1 = 1.807, chr2 = 2.012, chr22 = 2.490.

The total Ts/Tv is read from the last (Total) column of the snpsift Ts/Tv table.

grep Ts/Tv output/vcf_qc/protocol_example.genotype.chr1.leftnorm.novel_variant.snipsift_tstv | rev | cut -d',' -f1 | rev

For known variants: in this toy dataset the known_variant Ts/Tv files are usually empty because the example variants are not present in dbSNP.

grep Ts/Tv output/vcf_qc/protocol_example.genotype.chr1.leftnorm.known_variant.snipsift_tstv | rev | cut -d',' -f1 | rev

To inspect novel-variant Ts/Tv on another chromosome (e.g. chr22):

grep Ts/Tv output/vcf_qc/protocol_example.genotype.chr22.leftnorm.novel_variant.snipsift_tstv | rev | cut -d',' -f1 | rev

Command Interface#

sos run VCF_QC.ipynb -h

Global parameters#

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# input can either be 1 vcf genoFile, or a list of vcf genoFile.
parameter: genoFile = paths
# The path to the file that contains the list of samples to remove (format FID, IID)
parameter: remove_samples = path('.')
# The path to the file that contains the list of samples to keep (format FID, IID)
parameter: keep_samples = path('.')
# Workdir
parameter: cwd = path("output")
# Number of threads
parameter: numThreads = 1
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Walltime 
parameter: walltime = '24h'
# Usually a whole genome VCF.gz file has the size of 200+GB, after testing, a minimum of 60GB of mem is requried.
parameter: mem = '60G'
# Software container option
parameter: container = ""
import re
parameter: entrypoint=""
# Whether to run GT-only VCF QC for inputs that do not carry per-sample DP/GQ/AD FORMAT fields.
parameter: gt_only_vcf_qc = False
# Deprecated alias for GT-only VCF QC.
parameter: skip_vcf_header_filtering = False
gt_only_vcf_qc = bool(gt_only_vcf_qc or skip_vcf_header_filtering)
vcf_qc_mode_suffix = '.gt_only' if gt_only_vcf_qc else ''
# use this function to edit memory string for PLINK input
from sos.utils import expand_size
cwd = path(f"{cwd:a}")

import os
def get_genotype_file(geno_file_paths):
    #
    def valid_geno_file(x):
        suffixes = path(x).suffixes
        if suffixes[-1] == '.bed':
            return True
        elif suffixes[-1] == '.vcf':
            return True
        elif len(suffixes)>1 and ''.join(suffixes[-2:]) == ".vcf.gz" :
            return True
        return False
    #
    def complete_geno_path(x, geno_file):
        if not valid_geno_file(x):
            raise ValueError(f"Genotype file {x} should be VCF (end with .vcf.gz) or PLINK bed file (end with .bed)")
        if not os.path.isfile(x):
            # relative path
            if not os.path.isfile(f'{geno_file:ad}/' + x):
                raise ValueError(f"Cannot find genotype file {x}")
            else:
                x = f'{geno_file:ad}/' + x
        return x
    # 
    def format_chrom(chrom):
        if chrom.startswith('chr'):
            chrom = chrom[3:]
        return chrom
    # Inputs are either VCF or bed, or a vector of them 
    if len(geno_file_paths) > 1:
        if all([valid_geno_file(x) for x in geno_file_paths]):
            return paths(geno_file_paths)
        else: 
            raise ValueError(f"Invalid input {geno_file_paths}")
    # Input is one genotype file or text list of genotype files
    geno_file = geno_file_paths[0]
    if valid_geno_file(geno_file):
        return paths(geno_file)
    else: 
        units = [x.strip().split() for x in open(geno_file).readlines() if x.strip() and not x.strip().startswith('#')]
        if all([len(x) == 1 for x in units]):
            return paths([complete_geno_path(x[0], geno_file) for x in units])
        elif all([len(x) == 2 for x in units]):
            genos = dict([(format_chrom(x[0]), path(complete_geno_path(x[1], geno_file))) for x in units])
        else:
            raise ValueError(f"{geno_file} should contain one column of file names, or two columns of chrom number and corresponding file name")
        return genos
    
genoFile = get_genotype_file(genoFile)

Annotate known and novel variants#

You can download the known variant reference from this link.

For a detailed explanation of the procedure and its rationale, please refer to this post.

[rename_chrs: provides = '{genoFile:nn}.add_chr.vcf.gz']
# This file can be downloaded from https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b150_GRCh38p7/VCF/00-All.vcf.gz.
input: genoFile
output: f'{_input:nn}.add_chr.vcf.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', entrypoint=entrypoint
    for i in {1..22} X Y MT; do echo "$i chr$i"; done > ${_output:nn}.chr_name_conv.txt
    bcftools annotate --rename-chrs ${_output:nn}.chr_name_conv.txt ${_input} -Oz -o ${_output}
    tabix -p vcf ${_output}
    rm -f ${_output:nn}.chr_name_conv.txt

[dbsnp_annotate]
input: genoFile
output: f"{cwd}/{_input:bnn}.variants.gz"
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container = container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    # Extract specific fields from the VCF file using bcftools
    bcftools query -f'%CHROM\t%POS\t%ID\t%REF\t%ALT\n' ${_input} | \
    awk 'BEGIN { 
            OFS="\t" 
         } 
         {
            # Calculate end position based on the length of REF or ALT
            if (length($4) > length($5)) {
                end_pos = $2 + (length($4) - 1)
            } else {
                end_pos = $2 + (length($5) - 1)
            }
            print $1, $2, end_pos, $3
         }' | \
    # Compress the output using bgzip
    bgzip -c > ${_output}

Genotype QC#

This step handles multi-allelic sites and annotate variants to known and novel. We add an RS ID to variants in dbSNP. Variants without rsID are considered novel variants. For every hour it can produce ~14Gb of data, please set the –walltime parameter according to the size of your input files.

# Handel multi-allelic sites, left normalization of indels and add variant ID
[qc_1 (variant preprocessing)]
# Path to dbSNP variants generated previously
parameter: dbsnp_variants = path
# Path to fasta file for HG reference genome, eg GRCh38_full_analysis_set_plus_decoy_hla.fa
parameter: reference_genome = path
parameter: bi_allelic = False
parameter: snp_only = False
input: genoFile, group_by = 1
output: f'{cwd}/{_input:bnn}.{"leftnorm" if not bi_allelic else "biallelic"}{".snp" if snp_only else ""}{vcf_qc_mode_suffix}.vcf.gz'
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/genotype/VCF_QC.sh qc \
        --cwd "${cwd}" \
        --genoFile "${_input}" \
        --output "${_output}" \
        --dbsnp-variants "${dbsnp_variants}" \
        --reference-genome "${reference_genome}" \
        --bi-allelic "${bi_allelic}" \
        --snp-only "${snp_only}" \
        --gt-only-vcf-qc "${gt_only_vcf_qc}" \
        --numThreads ${numThreads}

This step filter variants based on FILTER PASS, DP and QC, fraction of missing genotypes (all samples), and on HWE, for snps and indels. It will also remove monomorphic sites – using bcftools view -c1.

# genotype QC
[qc_2 (variant level QC)]
# Maximum missingess per-variant, default to 0.2
parameter: geno_filter = 0.2
# Sample level QC - read depth (DP) to filter out SNPs below this value
# Default to 10, with WES data in mind 
# But for WGS, setting it to 2 may be fine considering the WGS may have low DP but the GQ filter should be good enough
parameter: DP_snp = 10
# Sample level QC - genotype quality (GQ) of specific sample. This measure tells you how confident we are that the genotype we assigned to a particular sample is correct
parameter: GQ = 20
# Sample level QC - read depth (DP) to filter out indels below this value
parameter: DP_indel = 10
# Allele balance for snps
parameter: AB_snp = 0.15
# Allele balance for indels
parameter: AB_indel = 0.2
# HWE filter, default to 0.0 which means no HWE filter is applied
parameter: hwe_filter = 0.0
output: f"{_input:nn}.bcftools_qc.vcf.gz"
task: trunk_workers = 1, trunk_size=job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/genotype/VCF_QC.sh qc_2 \
        --cwd "${cwd}" \
        --genoFile "${_input}" \
        --output "${_output}" \
        --gt-only-vcf-qc "${gt_only_vcf_qc}" \
        --geno-filter ${geno_filter} \
        --DP-snp ${DP_snp} \
        --GQ ${GQ} \
        --DP-indel ${DP_indel} \
        --AB-snp ${AB_snp} \
        --AB-indel ${AB_indel} \
        --hwe-filter ${hwe_filter} \
        --numThreads ${numThreads}

[qc_3 (genotype data summary statistics)]
input: output_from('qc_1'), output_from('qc_2'), group_by = 1
output: f"{cwd}/{_input:bnn}.novel_variant_sumstats", 
        f"{cwd}/{_input:bnn}.known_variant_sumstats", 
        f"{cwd}/{_input:bnn}.novel_variant.snipsift_tstv",
        f"{cwd}/{_input:bnn}.known_variant.snipsift_tstv"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/genotype/VCF_QC.sh qc_3 \
        --cwd "${cwd}" \
        --genoFile "${_input}" \
        --novel-sumstats "${_output[0]}" \
        --known-sumstats "${_output[1]}" \
        --novel-tstv "${_output[2]}" \
        --known-tstv "${_output[3]}" \
        --numThreads ${numThreads}

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.