Genotype data preprocessing#
This mini-protocol outlines the workflow for processing genotype files, transitioning from VCF format to chromosome-specific PLINK files, running PLINK QC, preparing unrelated individuals for PCA, and conducting PCA.
Miniprotocol Timing#
This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.
Timing < X minutes
Overview#
This workflow is an application of the genotype related workflows from the xQTL project pipeline.
VCF_QC.ipynb
(step i): QC on VCF filesgenotype_formatting.ipynb
(step ii, iv): convert VCF to PLINK and merge the filesGWAS_QC.ipynb
(step iii, v-vii): QC on PLINK files, Kinship, and prepare unrelated individuals for PCAPCA.ipynb
(viii): Conduct PCAGRM.ipynb
(): Generates genomic relationship matrices (GRM) under the leave-one-chromosome-out (LOCO) theme
Steps#
i. QC for VCF files#
# We only do this for autosomal variants
echo ./ZOD14598_AD_GRM_WGS_2021-04-29_chr1.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr2.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr3.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr4.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr5.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr6.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr7.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr8.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr9.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr10.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr11.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr12.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr13.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr14.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr15.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr16.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr17.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr18.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr19.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr20.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr21.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr22.recalibrated_variants.vcf.gz \
| tr ' ' '\n' > /path/to/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt
sos run pipeline/VCF_QC.ipynb qc \
--genoFile /path/to/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt \
--dbsnp-variants /path/to/reference_data/00-All.add_chr.variants.gz \
--reference-genome /path/to/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--cwd vcf_qc/ \
-J 22 --mem 120G
ii. Merge separated bed files into one#
Converting VCF to PLINK keeping only the ROSMAP samples.
sos run pipeline/genotype_formatting.ipynb vcf_to_plink \
--genoFile `ls data/WGS/vcf/wgs.chr*.random.vcf.gz` \
--cwd output/plink/
sos run pipeline/genotype_formatting.ipynb merge_plink \
--genoFile `ls output/plink/wgs.chr*.random.bed` \
--name wgs.merged \
--cwd output/plink/
iii. QC for PLINK files#
Using PLINK-based workflows we:
Filter out those have more than 10% missing
Set HWE cutoff as 1E-8
No minor allele filter
sos run pipeline/GWAS_QC.ipynb qc_no_prune \
--cwd output/plink \
--genoFile output/plink/wgs.merged.bed \
--geno-filter 0.1 \
--mind-filter 0.1 \
--hwe-filter 1e-08 \
--mac-filter 0
iv. Genotype data partition by chromosome#
This step is necessary for TensorQTL applications.
sos run pipeline/genotype_formatting.ipynb genotype_by_chrom \
--genoFile output/plink/wgs.merged.plink_qc.bed \
--cwd output/genotype_by_chrom \
--chrom `cut -f 1 output/plink/wgs.merged.plink_qc.bim | uniq | sed "s/chr//g"`
v. Sample match with genotype#
sos run pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
--cwd output/genotype/ \
--genoFile output/plink/wgs.merged.plink_qc.fam \
--phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz
vi. Kinship#
To accuratly estimate the PCs for the genotype. We split participants based on their kinship coefficients, estimated by KING
sos run pipeline/GWAS_QC.ipynb king \
--cwd output/genotype/kinship \
--genoFile output/plink/wgs.merged.plink_qc.bed \
--name wgs.merged.king \
--keep-samples output/genotype/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.sample_genotypes.txt
viii. PCA on genotype#
sos run pipeline/PCA.ipynb flashpca \
--cwd output/genotype/genotype_pca \
--genoFile output/genotype/wgs.merged.plink_qc.plink_qc.prune.bed
Anticipated Results#
Genotype preprocessing will produce cleaned genotype files, and genetic principal components.