Genotype Preprocessing#
This notebook is a minimal working example (MWE) of the genotype preprocessing section of the xQTL protocol. It chains the genotype worker pipelines on the bundled toy dataset (60 samples, protocol_example.*) so the whole section runs end-to-end without editing.
Description#
The genotype preprocessing section takes raw genotype data (VCF) through quality control, format conversion, and produces clean PLINK files plus genetic principal components for use as covariates downstream. It chains four worker pipelines:
VCF_QC.ipynb— quality control on VCF filesgenotype_formatting.ipynb— VCF→PLINK conversion, merging, and partition by chromosomeGWAS_QC.ipynb— QC on PLINK files, sample matching, kinship, and preparing unrelated individualsPCA.ipynb— principal component analysis on the unrelated samples
Timing: ~3–5 min for the toy dataset on a single node.
Input Files#
File |
Description |
|---|---|
|
Per-chromosome toy genotype VCFs (60 samples) |
|
dbSNP variant reference (for annotation) |
|
GRCh38 reference genome |
|
Molecular phenotype, used for sample matching |
Output Files#
File |
Description |
|---|---|
|
QC-passed, left-normalized VCF |
|
Merged genome-wide PLINK dataset |
|
QC-passed PLINK dataset |
|
Per-chromosome PLINK files |
|
Unrelated-individuals subset |
|
Genetic principal components and scree plot |
Steps#
Run the commands below in order from the toy example root. Each one invokes the corresponding worker pipeline through the pipeline/ symlinks. Outputs from earlier commands feed the later ones.
Quality control on VCF files#
Apply variant-level QC to a genotype VCF (here chromosome 22), annotating against dbSNP and the GRCh38 reference. --skip_vcf_header_filtering True is used for the toy VCFs.
sos run pipeline/VCF_QC.ipynb qc \
--genoFile input/genotype/protocol_example.genotype.chr22.vcf.gz \
--dbsnp-variants input/reference_data/00-All.add_chr.variants.gz \
--reference-genome input/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--cwd output/vcf_qc \
--skip_vcf_header_filtering True
Convert VCF to PLINK and merge#
Convert each per-chromosome VCF to PLINK format, then merge the per-chromosome PLINK files into a single genome-wide dataset.
sos run pipeline/genotype_formatting.ipynb vcf_to_plink \
--genoFile `ls input/genotype/protocol_example.genotype.chr*.vcf.gz | grep -vE "rawchr|withfmt|add_chr"` \
--cwd output/genotype_formatting/plink \
-j 4
sos run pipeline/genotype_formatting.ipynb merge_plink \
--genoFile `ls output/genotype_formatting/plink/protocol_example.genotype.chr*.bed` \
--name protocol_example.genotype.merged \
--cwd output/genotype_formatting/plink \
-j 2
Quality control on PLINK files#
Apply genotype-, sample-, and HWE-level filters to the merged PLINK dataset.
sos run pipeline/GWAS_QC.ipynb qc_no_prune \
--cwd output/gwas_qc/plink \
--genoFile output/genotype_formatting/plink/protocol_example.genotype.merged.bed \
--geno-filter 0.1 \
--mind-filter 0.1 \
--hwe-filter 1e-08 \
--mac-filter 0
Partition genotype data by chromosome#
Split the QC-passed PLINK dataset back into per-chromosome files, which several downstream xQTL steps expect.
sos run pipeline/genotype_formatting.ipynb genotype_by_chrom \
--genoFile output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bed \
--cwd output/genotype_by_chrom \
--chrom `cut -f 1 output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bim | uniq | sed "s/chr//g"` \
-j 4
Match samples with the molecular phenotype#
Identify the samples that have both genotype and molecular phenotype data, producing the overlap sample list.
sos run pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
--cwd output/gwas_qc/genotype \
--genoFile output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.fam \
--phenoFile input/rnaseq/protocol_example.rnaseq.bed.gz
Estimate kinship#
Run KING to estimate relatedness among the matched samples and flag related pairs.
sos run pipeline/GWAS_QC.ipynb king \
--cwd output/gwas_qc/kinship \
--genoFile output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bed \
--name protocol_example.king \
--keep-samples output/gwas_qc/genotype/protocol_example.rnaseq.bed.sample_genotypes.txt
Principal component analysis on genotype#
Run flashPCA on the unrelated, pruned dataset to compute genetic principal components for use as covariates.
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.
sos run pipeline/PCA.ipynb flashpca \
--cwd output/pca_uf \
--genoFile output/gwas_qc/genotype/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.plink_qc.prune.bed \
--name protocol_example
Compute a genotype PCA directly on the basic QC-passed merged dataset (MAC-filtered and LD-pruned) and write results to output/genotype/genotype_pca/. This .rds model is used by the covariate formatting step.
sos run pipeline/GWAS_QC.ipynb qc \
--cwd output/genotype/genotype_pca \
--genoFile output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bed \
--mac-filter 5
sos run pipeline/PCA.ipynb flashpca \
--cwd output/genotype/genotype_pca \
--genoFile output/genotype/genotype_pca/protocol_example.genotype.merged.plink_qc.plink_qc.prune.bed