Genotype Preprocessing#

This notebook is a minimal working example (MWE) of the genotype preprocessing section of the xQTL protocol. It chains the genotype worker pipelines on the bundled toy dataset (60 samples, protocol_example.*) so the whole section runs end-to-end without editing.

Description#

The genotype preprocessing section takes raw genotype data (VCF) through quality control, format conversion, and produces clean PLINK files plus genetic principal components for use as covariates downstream. It chains four worker pipelines:

  • VCF_QC.ipynb — quality control on VCF files

  • genotype_formatting.ipynb — VCF→PLINK conversion, merging, and partition by chromosome

  • GWAS_QC.ipynb — QC on PLINK files, sample matching, kinship, and preparing unrelated individuals

  • PCA.ipynb — principal component analysis on the unrelated samples

Timing: ~3–5 min for the toy dataset on a single node.

Input Files#

File

Description

input/genotype/protocol_example.genotype.chr*.vcf.gz

Per-chromosome toy genotype VCFs (60 samples)

input/reference_data/00-All.add_chr.variants.gz

dbSNP variant reference (for annotation)

input/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta

GRCh38 reference genome

input/rnaseq/protocol_example.rnaseq.bed.gz

Molecular phenotype, used for sample matching

Output Files#

File

Description

output/vcf_qc/*.leftnorm.*.vcf.gz

QC-passed, left-normalized VCF

output/genotype_formatting/plink/protocol_example.genotype.merged.bed

Merged genome-wide PLINK dataset

output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bed

QC-passed PLINK dataset

output/genotype_formatting/genotype_by_chrom/*.bed

Per-chromosome PLINK files

output/gwas_qc/kinship/*.unrelated.bed

Unrelated-individuals subset

output/pca_uf/*.pca.rds / *.pca.scree.png

Genetic principal components and scree plot

Steps#

Run the commands below in order from the toy example root. Each one invokes the corresponding worker pipeline through the pipeline/ symlinks. Outputs from earlier commands feed the later ones.

Quality control on VCF files#

Apply variant-level QC to a genotype VCF (here chromosome 22), annotating against dbSNP and the GRCh38 reference. --skip_vcf_header_filtering True is used for the toy VCFs.

sos run pipeline/VCF_QC.ipynb qc \
    --genoFile input/genotype/protocol_example.genotype.chr22.vcf.gz \
    --dbsnp-variants input/reference_data/00-All.add_chr.variants.gz \
    --reference-genome input/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --cwd output/vcf_qc \
    --skip_vcf_header_filtering True

Partition genotype data by chromosome#

Split the QC-passed PLINK dataset back into per-chromosome files, which several downstream xQTL steps expect.

sos run pipeline/genotype_formatting.ipynb genotype_by_chrom \
    --genoFile output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bed \
    --cwd output/genotype_by_chrom \
    --chrom `cut -f 1 output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bim | uniq | sed "s/chr//g"` \
    -j 4

Match samples with the molecular phenotype#

Identify the samples that have both genotype and molecular phenotype data, producing the overlap sample list.

sos run pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
    --cwd output/gwas_qc/genotype \
    --genoFile output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.fam \
    --phenoFile input/rnaseq/protocol_example.rnaseq.bed.gz

Estimate kinship#

Run KING to estimate relatedness among the matched samples and flag related pairs.

sos run pipeline/GWAS_QC.ipynb king \
    --cwd output/gwas_qc/kinship \
    --genoFile output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bed \
    --name protocol_example.king \
    --keep-samples output/gwas_qc/genotype/protocol_example.rnaseq.bed.sample_genotypes.txt

Prepare unrelated individuals for PCA#

Apply a minor-allele-count filter to the unrelated subset produced by the kinship step, yielding the pruned dataset used for PCA.

sos run pipeline/GWAS_QC.ipynb qc \
    --cwd output/gwas_qc/genotype \
    --genoFile output/gwas_qc/kinship/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.bed \
    --mac-filter 5

Principal component analysis on genotype#

Run flashPCA on the unrelated, pruned dataset to compute genetic principal components for use as covariates.

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

sos run pipeline/PCA.ipynb flashpca \
    --cwd output/pca_uf \
    --genoFile output/gwas_qc/genotype/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.plink_qc.prune.bed \
    --name protocol_example

Compute a genotype PCA directly on the basic QC-passed merged dataset (MAC-filtered and LD-pruned) and write results to output/genotype/genotype_pca/. This .rds model is used by the covariate formatting step.

sos run pipeline/GWAS_QC.ipynb qc \
    --cwd output/genotype/genotype_pca \
    --genoFile output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bed \
    --mac-filter 5
sos run pipeline/PCA.ipynb flashpca \
    --cwd output/genotype/genotype_pca \
    --genoFile output/genotype/genotype_pca/protocol_example.genotype.merged.plink_qc.plink_qc.prune.bed