Genotype data preprocessing#

This mini-protocol outlines the workflow for processing genotype files, transitioning from VCF format to chromosome-specific PLINK files, running PLINK QC, preparing unrelated individuals for PCA, and conducting PCA.

Miniprotocol Timing#

This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.

Timing < X minutes

Overview#

This workflow is an application of the genotype related workflows from the xQTL project pipeline.

  1. VCF_QC.ipynb (step i): QC on VCF files

  2. genotype_formatting.ipynb (step ii, iv): convert VCF to PLINK and merge the files

  3. GWAS_QC.ipynb (step iii, v-vii): QC on PLINK files, Kinship, and prepare unrelated individuals for PCA

  4. PCA.ipynb (viii): Conduct PCA

  5. GRM.ipynb (): Generates genomic relationship matrices (GRM) under the leave-one-chromosome-out (LOCO) theme

Steps#

i. QC for VCF files#

# We only do this for autosomal variants

echo ./ZOD14598_AD_GRM_WGS_2021-04-29_chr1.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr2.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr3.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr4.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr5.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr6.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr7.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr8.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr9.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr10.recalibrated_variants.vcf.gz  ./ZOD14598_AD_GRM_WGS_2021-04-29_chr11.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr12.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr13.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr14.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr15.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr16.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr17.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr18.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr19.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr20.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr21.recalibrated_variants.vcf.gz ./ZOD14598_AD_GRM_WGS_2021-04-29_chr22.recalibrated_variants.vcf.gz \
    | tr ' ' '\n' > /path/to/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt

sos run pipeline/VCF_QC.ipynb qc \
    --genoFile /path/to/ZOD14598_AD_GRM_WGS_2021-04-29_vcf_files.txt \
    --dbsnp-variants /path/to/reference_data/00-All.add_chr.variants.gz \
    --reference-genome /path/to/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --cwd vcf_qc/ \
    -J 22 --mem 120G 

ii. Merge separated bed files into one#

Converting VCF to PLINK keeping only the ROSMAP samples.

sos run pipeline/genotype_formatting.ipynb vcf_to_plink \
    --genoFile `ls data/WGS/vcf/wgs.chr*.random.vcf.gz` \
    --cwd output/plink/ 
sos run pipeline/genotype_formatting.ipynb merge_plink \
    --genoFile `ls output/plink/wgs.chr*.random.bed` \
    --name wgs.merged \
    --cwd output/plink/ 

iv. Genotype data partition by chromosome#

This step is necessary for TensorQTL applications.

sos run pipeline/genotype_formatting.ipynb genotype_by_chrom \
    --genoFile output/plink/wgs.merged.plink_qc.bed \
    --cwd output/genotype_by_chrom \
    --chrom `cut -f 1 output/plink/wgs.merged.plink_qc.bim | uniq | sed "s/chr//g"`

v. Sample match with genotype#

sos run pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
        --cwd output/genotype/ \
        --genoFile output/plink/wgs.merged.plink_qc.fam  \
        --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz

vi. Kinship#

To accuratly estimate the PCs for the genotype. We split participants based on their kinship coefficients, estimated by KING

sos run pipeline/GWAS_QC.ipynb king \
    --cwd output/genotype/kinship \
    --genoFile output/plink/wgs.merged.plink_qc.bed \
    --name wgs.merged.king \
    --keep-samples output/genotype/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.sample_genotypes.txt

vii. Prepare unrelated individuals data for PCA#

sos run pipeline/GWAS_QC.ipynb qc \
   --cwd output/genotype/ \
   --genoFile output/genotype/kinship/wgs.merged.plink_qc.wgs.merged.king.unrelated.bed \
   --mac-filter 5 

If No related individuals detected from *.kin0 occurs, there is no separate genotype data generated for unrelated individuals. In this case, we need to work from the original genotype data and must use --keep-samples to run qc to extract samples for PCA. For example:

sos run pipeline/GWAS_QC.ipynb qc \
   --cwd output/genotype/ \
   --genoFile output/plink/wgs.merged.plink_qc.bed \
   --mac-filter 5 

viii. PCA on genotype#

sos run pipeline/PCA.ipynb flashpca \
   --cwd output/genotype/genotype_pca \
   --genoFile output/genotype/wgs.merged.plink_qc.plink_qc.prune.bed

Anticipated Results#

Genotype preprocessing will produce cleaned genotype files, and genetic principal components.