Principal Component Analysis

Principal Component Analysis#

The intention of this notebook is to perform the PCA analysis on genotype data and generate plots.

Description#

This notebook performs PCA on genotype data of unrelated individuals and projects the remaining (related) samples back into that PCA space. The main steps are: remove related individuals, LD-prune the variants, run PCA on the unrelated set, and exclude PCA-space outliers.

Relatedness estimation and sample QC are done upstream in GWAS_QC.ipynb.

Input Files#

File	Description
`output/gwas_qc/genotype/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.plink_qc.prune.bed`	LD-pruned PLINK bundle for unrelated individuals (from GWAS_QC). PCA is computed on these samples.
`output/gwas_qc/kinship/protocol_example.genotype.merged.plink_qc.protocol_example.king.related.bed`	PLINK bundle for related individuals (from GWAS_QC), projected back onto the PCA model.
`input/covariate/protocol_example.pca_pheno.txt`	Phenotype/label table with an `IID` column (plus optional `FID`) used to label/colour the projected samples. A PLINK `.fam` file may be used instead.

Genotypes must be in PLINK format (common variants, LD pruned) and split into related and unrelated sets. The phenotype file may also carry population/disease columns for labelling the PCA plots.

Output Files#

File	Description
`output/pca_uf/*.pca.rds`	PCA model + PC scores for the unrelated samples (RDS).
`output/pca_uf/*.pca.txt`	PC scores table for the unrelated samples.
`output/pca_uf/*.pca.scree.png`	Scree plot of variance explained.
`output/pca_uf/.pca.mahalanobis.`, `*.pca.outliers`	Mahalanobis distances per population and the list of PC outliers.
`output/pca_uf/.projected.rds` / `.projected.txt`	Projected PC scores for the related individuals.
`output/pca_uf/*.projected.pc.png`	PCA plot including the projected related samples.

Minimal Working Example#

The data can be found on Synapse.

Note: parameters set for the MWE are meant to let the MWE work to show the workflow procedures. They may be unrealistic and should not be used in practice. The pipeline has reasonable default values for what we suggest to use in practice for most of the parameters.

Step 1. Estimate kinship in the sample (prerequisite)#

Before PCA we need to know which individuals are related, because PCA must be computed on an unrelated subset and the related individuals are projected back afterwards. This step runs the king workflow from GWAS_QC to estimate pairwise kinship and split the samples into unrelated and related sets. It is an upstream prerequisite — its outputs (the king.unrelated and king.related PLINK bundles) feed the PCA steps below.

sos run pipeline/GWAS_QC.ipynb king \
    --cwd output/gwas_qc/kinship \
    --genoFile output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bed \
    --name protocol_example.king \
    --keep-samples output/gwas_qc/genotype/protocol_example.rnaseq.bed.sample_genotypes.txt

Step 2. Sample selection and QC of the genotype data (prerequisite)#

QC the genotypes that will go into PCA: filter on MAF, sample/variant missingness and Hardy-Weinberg equilibrium, and LD-prune the variants. You can restrict to one population or to the related / unrelated split. Here we QC and LD-prune the unrelated individuals to build the PCA reference, and separately extract the same variants for the related individuals so they can be projected later. These qc / qc_no_prune calls come from GWAS_QC and are upstream prerequisites for the PCA steps.

Variant level and sample level QC on unrelated individuals, in preparation for PCA analysis#

sos run pipeline/GWAS_QC.ipynb qc \
    --cwd output/gwas_qc/genotype \
    --genoFile output/gwas_qc/kinship/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.bed \
    --mac-filter 5

Extract the previously selected (LD-pruned) variants from the related individuals, applying only a sample-level missingness filter so the related samples carry exactly the same variant set as the unrelated reference.

sos run pipeline/GWAS_QC.ipynb qc_no_prune \
    --cwd output/pca_related \
    --genoFile output/gwas_qc/kinship/protocol_example.genotype.merged.plink_qc.protocol_example.king.related.bed \
    --geno-filter 0 --mind-filter 0.1 --maf-filter 0 \
    --keep-variants output/gwas_qc/genotype/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.plink_qc.prune.in \
    --name for_pca

Step 3. Run PCA on the unrelated samples (flashpca)#

This is the core PCA step. It runs flashpca on the LD-pruned unrelated PLINK bundle to compute the principal components, then automatically computes Mahalanobis distances and flags PC outliers. Outputs: the fitted PCA model (.pca.rds), the PC score table (.pca.txt), a scree plot of variance explained, and the outlier list. Timing: <2 min.

The --name argument tags all output files; point --genoFile at the LD-pruned unrelated bed produced upstream.

sos run pipeline/PCA.ipynb flashpca \
    --cwd output/pca_uf \
    --genoFile output/gwas_qc/genotype/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.plink_qc.prune.bed \
    --name protocol_example

What you should see: the workflow runs flashpca_1 (PCA), then detect_outliers (Mahalanobis distance per population) and plot_pca, finishing with “executed successfully with 5 completed steps”. The PCA model and PC scores are written to the --cwd directory. The scree plot below shows the proportion of variance explained by each PC — use it to decide how many PCs to keep.

%preview output/pca_uf/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.plink_qc.prune.protocol_example.pca.scree.png

Step 4. Project the related individuals onto the PCA model#

The related individuals were held out of the PCA, so we now project them onto the model fitted from the unrelated samples. First reduce the related genotypes to exactly the same pruned variant set used to build the model (the plink2 --extract cell below) — otherwise the projection errors with “Input number of variants should be the same as used in the previous PCA model”. Then run project_samples, passing a phenotype table (must contain an IID column) and, if available, a population/ethnicity column via --label-col / --pop-col (here the race column) so the projected samples can be coloured and the outlier detection can run.

sos run pipeline/PCA.ipynb project_samples \
    --cwd output/pca_uf \
    --genoFile output/pca_related/protocol_example.genotype.merged.plink_qc.protocol_example.king.related.for_pca.plink_qc.extracted.bed \
    --phenoFile input/covariate/protocol_example.pca_pheno.txt \
    --pca-model output/pca_uf/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.plink_qc.prune.protocol_example.pca.rds \
    --label-col race \
    --pop-col race \
    --maha-k 2

%preview output/pca_uf/protocol_example.pca_pheno.pca.projected.pc.png

sos run pipeline/GWAS_QC.ipynb qc_no_prune \
    --cwd output/pca_related \
    --genoFile output/pca_related/protocol_example.genotype.merged.plink_qc.protocol_example.king.related.for_pca.plink_qc.extracted.bed \
    --remove-samples output/pca_uf/protocol_example.pca_pheno.pca.projected.outliers \
    --name no_outlier

Re-run the QC on the related individuals after dropping any PCA-detected outliers (passed via --remove-samples), keeping the same variant set as the unrelated samples. The output PLINK bundle protocol_example.related.for_pca.no_outlier.plink_qc.* is the outlier-free, QC’d genotype set for the related individuals.

Note: for this clean toy dataset no outliers are detected, so the removal list is empty and this step simply re-emits the QC’d related set.

Finally, merge back related and unrelated individuals as finalized genotype in PLINK format,

sos run pipeline/genotype_formatting.ipynb merge_plink \
    --genoFile output/gwas_qc/genotype/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.plink_qc.prune.bed \
               output/pca_related/protocol_example.genotype.merged.plink_qc.protocol_example.king.related.for_pca.plink_qc.extracted.no_outlier.plink_qc.bed \
    --cwd output/genotype_final \
    --name protocol_example.qced

You can stop here if you’re analyzing a homogenous population.

Optional — Analysis by population (admixed / multi-ancestry data)#

The Steps 1–4 above produce a single PCA for the whole cohort, which is appropriate for a homogeneous population. If your cohort contains multiple ancestry groups, run the per-population workflow below instead: split the samples by population, then repeat the same QC → PCA → projection steps separately within each population.

The sub-steps below mirror Steps 1–4 exactly, just looped over each population (for i in race1 race3). Follow them in order:

Sub-step	What it does
P0. Split by population	Write a per-population sample-ID list
P1. QC unrelated	Variant/sample QC within each population (unrelated samples)
P2. Extract variants in related	Keep the P1 pruned variants in the related samples
P3. PCA on unrelated	Run `flashpca` per population
P4. Project related	Project related samples onto each population’s PCA
P5. Finalize	Per-population QC to finalize genotypes

# Build per-population sample keep-lists from the phenotype table.
# One file per ancestry group, named output/pca_uf/protocol_example.ID.<race>.txt
# Each file holds two columns (FID, IID) with no header, matching plink's --keep format.
setwd("/mnt/lustre/lab/gwang/home/rl3328/evo2/xqtl-protocol")
pheno <- read.table("input/covariate/protocol_example.pca_pheno.txt", header = TRUE, stringsAsFactors = FALSE)

if (all(pheno$FID == pheno$IID)) pheno$FID <- "0"

for (i in 1:3) {
    race    <- subset(pheno, race == i)
    race_id <- cbind(race$FID, race$IID)
    write.table(race_id,
                paste0("output/pca_uf/protocol_example.ID.", i, ".txt"),
                quote = FALSE, sep = "\t", col.names = FALSE, row.names = FALSE)
}

P1 — QC on unrelated samples, per population#

Same as Step 2 (QC), but looped over each population. Below we only show Populations 1 and 3 as an example. Next: once each population has a QC’d unrelated set, extract those variants from the related samples (P2).

for i in 3; do
    sos run pipeline/GWAS_QC.ipynb qc \
        --cwd output/pca_uf \
        --genoFile output/gwas_qc/genotype/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.plink_qc.bed \
        --keep-samples output/pca_uf/protocol_example.ID.$i.txt \
        --mac-filter 5 \
        --bad-ld True \
        --name pop_$i
done

P3 — Run PCA (flashpca) on unrelated samples, per population#

Same as Step 3, looped per population. Here the number of PCs is set to 5 (--k 5) because each population’s sample size is small. Next: project the related samples onto each population’s PCA model (P4).

for i in 3; do
    sos run pipeline/PCA.ipynb flashpca \
        --name pop_$i \
        --cwd output/pca_uf \
        --genoFile output/pca_uf/protocol_example.genotype.merged.plink_qc.protocol_example.king.unrelated.plink_qc.pop_$i.plink_qc.prune.bed \
        --phenoFile input/covariate/protocol_example.pca_pheno.txt \
        --label-col race \
        --pop-col race \
        --maha-k 2 \
        --k 5
done

P5 — Finalize genotypes per population#

Same as the Finalize genotype QC by PCA step for a homogeneous population, applied per population and taking the detected outliers into account. See the GWAS_QC.ipynb documentation for the available QC options and recommendations. This is the end of the per-population workflow.

Command Interface#

sos run PCA.ipynb -h

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# the output directory for generated files
parameter: cwd = path("output")
# A string to identify your analysis run
parameter: name = ""
# Name of the population column in the phenoFile
parameter: pop_col = ""
# Name of the populations (from the population column) you would like to plot and show on the PCA plot
parameter: pops = []
# Name of the color label column in the phenoFile; can be the same as population column. Can also be a separate column eg a "super population" column as a way to enable you to combine selected populations based on another column.
parameter: label_col = ""
# Number of Principal Components to output,must be consistant between flashpca run and project samples run (flashpca partial PCA method).
parameter: k = 20
# Number of Principal Components based on which outliers should be evaluated. Default is 5 but this should be based on examine the scree plot
parameter: maha_k = 5
# Homogeneity of populations. Set to --homogeneous when true and --no-homogeneous when false
parameter: homogeneous = False
# Software container option
parameter: container = ""
import re
parameter: entrypoint= ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 10
suffix = '_'.join(pops)
cwd = path(f"{cwd:a}")
if not pop_col:
    homogeneous = True

# Determine if the file is in PLINK1 (BED/BIM/FAM) or PLINK2 (PGEN/PVAR/PSAM) format
def determine_plink_format(file_path):
    """
    Determine the PLINK file format based on file extensions and companion files.
    
    Args:
        file_path (str or Path): Path to the input file
    
    Returns:
        str: 'plink1' or 'plink2'
    """
    # Convert to string if it's a Path object
    file_path = str(file_path)
    
    # Check direct file extensions
    if file_path.endswith('.bed'):
        return 'plink1'
    elif file_path.endswith('.pgen'):
        return 'plink2'
    
    # If the file doesn't have a standard extension, try to infer format
    try:
        # Remove the file extension if present
        base_path = file_path.rsplit('.', 1)[0] if '.' in file_path else file_path
        
        # Check for PLINK1 companion files
        plink1_companion_files = [
            f"{base_path}.bim",
            f"{base_path}.fam"
        ]
        
        # Check for PLINK2 companion files
        plink2_companion_files = [
            f"{base_path}.pvar",
            f"{base_path}.psam"
        ]
        
        # Check PLINK1 format
        if all(os.path.exists(f) for f in plink1_companion_files):
            return 'plink1'
        
        # Check PLINK2 format
        if all(os.path.exists(f) for f in plink2_companion_files):
            return 'plink2'
    
    except Exception as e:
        print(f"Error determining PLINK format: {e}")
    
    # Default to PLINK1 if can't determine
    return 'plink1'

# Get the appropriate PLINK command prefix
def get_plink_command_prefix(file_path):
    format_type = determine_plink_format(file_path)
    if format_type == 'plink1':
        return "--bfile"
    else:  # plink2
        return "--pfile"

def normalize_phenoFile(pheno_path, out_dir):
    """Return a phenoFile path where FID=0 if the file has FID==IID in every data row.
    Writes a normalized copy to out_dir when needed; otherwise returns the original path."""
    import os, csv
    pheno_path = str(pheno_path)
    with open(pheno_path) as fh:
        rows = [r for r in csv.reader(fh, delimiter='\t')]
    if not rows:
        return path(pheno_path)
    # Detect header: first row is a header when its first cell is not a sample value
    has_header = rows[0][0].upper() in ('FID', '#FID', 'IID', 'SAMPLE', 'SAMPLE_ID')
    data_rows = rows[1:] if has_header else rows
    if not data_rows or not all(len(r) >= 2 and r[0] == r[1] for r in data_rows):
        return path(pheno_path)
    # FID==IID everywhere — write a normalized copy to out_dir with FID=0
    out_path = os.path.join(str(out_dir), os.path.basename(pheno_path))
    with open(out_path, 'w', newline='') as fh:
        w = csv.writer(fh, delimiter='\t')
        if has_header:
            w.writerow(rows[0])
        for r in data_rows:
            r[0] = '0'
            w.writerow(r)
    return path(out_path)

# PCA command with PLINK, as a sanity check
[pca_plink]
# PLINK binary file
parameter: genoFile = path
input: genoFile
plink_command = get_plink_command_prefix(_input)
output: f'{cwd}/{genoFile:bn}.pca.eigenvec'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container = container, expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    plink2 ${plink_command} ${_input:n} --out ${_output:n} --pca ${k}

PCA analysis#

# Run PCA analysis using flashpca 
[flashpca_1]
# Plink binary file
parameter: genoFile = path
# The phenotypic file
parameter: phenoFile = path(f'{genoFile}'.replace(".bed", ".fam").replace(".pgen", ".psam"))
# minimum population size to consider in the analysis
parameter: min_pop_size = 2
# How to standardize X before PCA
parameter: stand = "binom2"
## Input genoFile here is for unrelated samples
phenoFile = normalize_phenoFile(phenoFile, cwd)
input: genoFile, phenoFile
output: f'{cwd}/{phenoFile:bn}{("."+name) if name else ""}.{(suffix+".") if suffix != "" else ""}pca.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/data_preprocessing/genotype/PCA.R \
        --step flashpca \
        --cwd "${cwd}" \
        --genoFile "${_input[0]}" \
        --phenoFile "${_input[1]}" \
        --output "${_output}" \
        --stand "${stand}" \
        --min-pop-size ${min_pop_size} \
        --homogeneous "${homogeneous}" \
        --pop-col "${pop_col}" \
        --label-col "${label_col}" \
        --pops "${",".join([str(x) for x in pops])}" \
        --k ${k} \
        --maha-k ${maha_k} \
        --numThreads ${numThreads}

Plot PCA results#

# Plot PCA results. 
# Can be used independently as "plot_pca" or combined with other workflow as eg "flashpca+plot_pca"
[plot_pca]
parameter: outlier_file = path()
parameter: plot_data = path
parameter: min_axis = ""
parameter: max_axis = ""
input: plot_data
output: f'{cwd}/{_input:bn}.pc.png',
        f'{cwd}/{_input:bn}.scree.png',
        f'{cwd}/{_input:bn}.scree.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = 1, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/data_preprocessing/genotype/PCA.R \
        --step plot_pca \
        --cwd "${cwd}" \
        --plot-data "${_input}" \
        --outlier-file "${outlier_file}" \
        --min-axis "${min_axis}" \
        --max-axis "${max_axis}" \
        --pop-col "${pop_col}" \
        --label-col "${label_col}" \
        --pops "${",".join([str(x) for x in pops])}" \
        --k ${k} \
        --output-pc-plot "${_output[0]}" \
        --output-scree-plot "${_output[1]}" \
        --output-scree-text "${_output[2]}" \
        --numThreads ${numThreads}

Detect outliers#

# Calculate Mahalanobis distance per population and report outliers
[detect_outliers]
# Set the probability to remove outliers eg 0.95 or 0.997
parameter: prob = 0.997
# Mahalanobis distance p-value cutoff
parameter: pval = 0.05
# Robust Mahalanobis to outliers
parameter: robust = True
parameter: pca_result = path
input: pca_result
output: distance=f'{_input:n}.mahalanobis.rds',
        identified_outliers=f'{_input:n}.outliers',
        analysis_summary=f'{_input:n}.analysis_summary.md',
        qqplot_mahalanobis=f'{_input:n}.mahalanobis_qq.png',
        hist_mahalanobis=f'{_input:n}.mahalanobis_hist.png'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = 1, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/data_preprocessing/genotype/PCA.R \
        --step detect_outliers \
        --cwd "${cwd}" \
        --pca-result "${_input}" \
        --prob ${prob} \
        --pval ${pval} \
        --robust ${"TRUE" if robust else "FALSE"} \
        --pop-col "${pop_col}" \
        --k ${k} \
        --distance-output "${_output['distance']}" \
        --identified-outliers-output "${_output['identified_outliers']}" \
        --analysis-summary-output "${_output['analysis_summary']}" \
        --qqplot-output "${_output['qqplot_mahalanobis']}" \
        --hist-output "${_output['hist_mahalanobis']}" \
        --numThreads ${numThreads}

Add plot and outlier detection to PCA steps#

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

[flashpca_2, project_samples_2]
# Set the probability to remove outliers eg 0.95 or 0.997
parameter: prob = 0.997
# Robust Mahalanobis to outliers
parameter: robust = True
output: distance=f'{_input:n}.mahalanobis.rds',
        identified_outliers=f'{_input:n}.outliers',
        analysis_summary=f'{_input:n}.analysis_summary.md',
        qqplot_mahalanobis=f'{_input:n}.mahalanobis_qq.png',
        hist_mahalanobis=f'{_input:n}.mahalanobis_hist.png'
sos_run("detect_outliers", pca_result=_input, prob=prob, robust=robust)

[flashpca_3, project_samples_3]
input: output_from(1), output_from(2)['identified_outliers']
outliers = [x.strip() for x in open(_input[1]).readlines() if x.strip()]
output: f"{cwd}/{_input[0]:bn}.pc.png",
        f"{cwd}/{_input[0]:bn}.scree.png"
sos_run("plot_pca", plot_data = _input[0], outlier_file = _input[1] if len(outliers) else path())