Principal Component Analysis#
The intention of this notebook is to perform the PCA analysis on genotype data and generate plots.
Description#
This notebook performs PCA on genotype data of unrelated individuals and projects the remaining (related) samples back into that PCA space. The main steps are: remove related individuals, LD-prune the variants, run PCA on the unrelated set, and exclude PCA-space outliers.
Relatedness estimation and sample QC are done upstream in GWAS_QC.ipynb.
Input Files#
File |
Description |
|---|---|
|
LD-pruned PLINK bundle for unrelated individuals (from GWAS_QC). PCA is computed on these samples. |
|
PLINK bundle for related individuals (from GWAS_QC), projected back onto the PCA model. |
|
Phenotype/label table with an |
Genotypes must be in PLINK format (common variants, LD pruned) and split into related and unrelated sets. The phenotype file may also carry population/disease columns for labelling the PCA plots.
Output Files#
File |
Description |
|---|---|
|
PCA model + PC scores for the unrelated samples (RDS). |
|
PC scores table for the unrelated samples. |
|
Scree plot of variance explained. |
|
Mahalanobis distances per population and the list of PC outliers. |
|
Projected PC scores for the related individuals. |
|
PCA plot including the projected related samples. |
Minimal Working Example#
The data can be found on Synapse.
Note: parameters set for the MWE are meant to let the MWE work to show the workflow procedures. They may be unrealistic and should not be used in practice. The pipeline has reasonable default values for what we suggest to use in practice for most of the parameters.
Step 1. Estimate kinship in the sample (prerequisite)#
Before PCA we need to know which individuals are related, because PCA must be computed on an unrelated subset and the related individuals are projected back afterwards. This step runs the king workflow from GWAS_QC to estimate pairwise kinship and split the samples into unrelated and related sets. It is an upstream prerequisite — its outputs (the king.unrelated and king.related PLINK bundles) feed the PCA steps below.
sos run pipeline/GWAS_QC.ipynb king \
--cwd output/gwas_qc/kinship \
--genoFile output/gwas_qc/plink/protocol_example.genotype.merged.plink_qc.bed \
--name protocol_example.king \
--keep-samples output/gwas_qc/genotype/protocol_example.rnaseq.bed.sample_genotypes.txt
Step 2. Sample selection and QC of the genotype data (prerequisite)#
QC the genotypes that will go into PCA: filter on MAF, sample/variant missingness and Hardy-Weinberg equilibrium, and LD-prune the variants. You can restrict to one population or to the related / unrelated split. Here we QC and LD-prune the unrelated individuals to build the PCA reference, and separately extract the same variants for the related individuals so they can be projected later. These qc / qc_no_prune calls come from GWAS_QC and are upstream prerequisites for the PCA steps.
Optional — Analysis by population (admixed / multi-ancestry data)#
The Steps 1–4 above produce a single PCA for the whole cohort, which is appropriate for a homogeneous population. If your cohort contains multiple ancestry groups, run the per-population workflow below instead: split the samples by population, then repeat the same QC → PCA → projection steps separately within each population.
The sub-steps below mirror Steps 1–4 exactly, just looped over each population (for i in race1 race3). Follow them in order:
Sub-step |
What it does |
|---|---|
P0. Split by population |
Write a per-population sample-ID list |
P1. QC unrelated |
Variant/sample QC within each population (unrelated samples) |
P2. Extract variants in related |
Keep the P1 pruned variants in the related samples |
P3. PCA on unrelated |
Run |
P4. Project related |
Project related samples onto each population’s PCA |
P5. Finalize |
Per-population QC to finalize genotypes |
# Build per-population sample keep-lists from the phenotype table.
# One file per ancestry group, named output/pca_uf/protocol_example.ID.<race>.txt
# Each file holds two columns (FID, IID) with no header, matching plink's --keep format.
setwd("/mnt/lustre/lab/gwang/home/rl3328/evo2/xqtl-protocol")
pheno <- read.table("input/covariate/protocol_example.pca_pheno.txt", header = TRUE, stringsAsFactors = FALSE)
if (all(pheno$FID == pheno$IID)) pheno$FID <- "0"
for (i in 1:3) {
race <- subset(pheno, race == i)
race_id <- cbind(race$FID, race$IID)
write.table(race_id,
paste0("output/pca_uf/protocol_example.ID.", i, ".txt"),
quote = FALSE, sep = "\t", col.names = FALSE, row.names = FALSE)
}
P5 — Finalize genotypes per population#
Same as the Finalize genotype QC by PCA step for a homogeneous population, applied per population and taking the detected outliers into account. See the GWAS_QC.ipynb documentation for the available QC options and recommendations. This is the end of the per-population workflow.
Command Interface#
sos run PCA.ipynb -h
[global]
parameter: modular_script_dir = path('code/script') # override with --modular-script-dir
# the output directory for generated files
parameter: cwd = path("output")
# A string to identify your analysis run
parameter: name = ""
# Name of the population column in the phenoFile
parameter: pop_col = ""
# Name of the populations (from the population column) you would like to plot and show on the PCA plot
parameter: pops = []
# Name of the color label column in the phenoFile; can be the same as population column. Can also be a separate column eg a "super population" column as a way to enable you to combine selected populations based on another column.
parameter: label_col = ""
# Number of Principal Components to output,must be consistant between flashpca run and project samples run (flashpca partial PCA method).
parameter: k = 20
# Number of Principal Components based on which outliers should be evaluated. Default is 5 but this should be based on examine the scree plot
parameter: maha_k = 5
# Homogeneity of populations. Set to --homogeneous when true and --no-homogeneous when false
parameter: homogeneous = False
# Software container option
parameter: container = ""
import re
parameter: entrypoint= ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 10
suffix = '_'.join(pops)
cwd = path(f"{cwd:a}")
if not pop_col:
homogeneous = True
# Determine if the file is in PLINK1 (BED/BIM/FAM) or PLINK2 (PGEN/PVAR/PSAM) format
def determine_plink_format(file_path):
"""
Determine the PLINK file format based on file extensions and companion files.
Args:
file_path (str or Path): Path to the input file
Returns:
str: 'plink1' or 'plink2'
"""
# Convert to string if it's a Path object
file_path = str(file_path)
# Check direct file extensions
if file_path.endswith('.bed'):
return 'plink1'
elif file_path.endswith('.pgen'):
return 'plink2'
# If the file doesn't have a standard extension, try to infer format
try:
# Remove the file extension if present
base_path = file_path.rsplit('.', 1)[0] if '.' in file_path else file_path
# Check for PLINK1 companion files
plink1_companion_files = [
f"{base_path}.bim",
f"{base_path}.fam"
]
# Check for PLINK2 companion files
plink2_companion_files = [
f"{base_path}.pvar",
f"{base_path}.psam"
]
# Check PLINK1 format
if all(os.path.exists(f) for f in plink1_companion_files):
return 'plink1'
# Check PLINK2 format
if all(os.path.exists(f) for f in plink2_companion_files):
return 'plink2'
except Exception as e:
print(f"Error determining PLINK format: {e}")
# Default to PLINK1 if can't determine
return 'plink1'
# Get the appropriate PLINK command prefix
def get_plink_command_prefix(file_path):
format_type = determine_plink_format(file_path)
if format_type == 'plink1':
return "--bfile"
else: # plink2
return "--pfile"
def normalize_phenoFile(pheno_path, out_dir):
"""Return a phenoFile path where FID=0 if the file has FID==IID in every data row.
Writes a normalized copy to out_dir when needed; otherwise returns the original path."""
import os, csv
pheno_path = str(pheno_path)
with open(pheno_path) as fh:
rows = [r for r in csv.reader(fh, delimiter='\t')]
if not rows:
return path(pheno_path)
# Detect header: first row is a header when its first cell is not a sample value
has_header = rows[0][0].upper() in ('FID', '#FID', 'IID', 'SAMPLE', 'SAMPLE_ID')
data_rows = rows[1:] if has_header else rows
if not data_rows or not all(len(r) >= 2 and r[0] == r[1] for r in data_rows):
return path(pheno_path)
# FID==IID everywhere — write a normalized copy to out_dir with FID=0
out_path = os.path.join(str(out_dir), os.path.basename(pheno_path))
with open(out_path, 'w', newline='') as fh:
w = csv.writer(fh, delimiter='\t')
if has_header:
w.writerow(rows[0])
for r in data_rows:
r[0] = '0'
w.writerow(r)
return path(out_path)
# PCA command with PLINK, as a sanity check
[pca_plink]
# PLINK binary file
parameter: genoFile = path
input: genoFile
plink_command = get_plink_command_prefix(_input)
output: f'{cwd}/{genoFile:bn}.pca.eigenvec'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container = container, expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
plink2 ${plink_command} ${_input:n} --out ${_output:n} --pca ${k}
PCA analysis#
# Run PCA analysis using flashpca
[flashpca_1]
# Plink binary file
parameter: genoFile = path
# The phenotypic file
parameter: phenoFile = path(f'{genoFile}'.replace(".bed", ".fam").replace(".pgen", ".psam"))
# minimum population size to consider in the analysis
parameter: min_pop_size = 2
# How to standardize X before PCA
parameter: stand = "binom2"
## Input genoFile here is for unrelated samples
phenoFile = normalize_phenoFile(phenoFile, cwd)
input: genoFile, phenoFile
output: f'{cwd}/{phenoFile:bn}{("."+name) if name else ""}.{(suffix+".") if suffix != "" else ""}pca.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/genotype/PCA.R \
--step flashpca \
--cwd "${cwd}" \
--genoFile "${_input[0]}" \
--phenoFile "${_input[1]}" \
--output "${_output}" \
--stand "${stand}" \
--min-pop-size ${min_pop_size} \
--homogeneous "${homogeneous}" \
--pop-col "${pop_col}" \
--label-col "${label_col}" \
--pops "${",".join([str(x) for x in pops])}" \
--k ${k} \
--maha-k ${maha_k} \
--numThreads ${numThreads}
Plot PCA results#
# Plot PCA results.
# Can be used independently as "plot_pca" or combined with other workflow as eg "flashpca+plot_pca"
[plot_pca]
parameter: outlier_file = path()
parameter: plot_data = path
parameter: min_axis = ""
parameter: max_axis = ""
input: plot_data
output: f'{cwd}/{_input:bn}.pc.png',
f'{cwd}/{_input:bn}.scree.png',
f'{cwd}/{_input:bn}.scree.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = 1, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/genotype/PCA.R \
--step plot_pca \
--cwd "${cwd}" \
--plot-data "${_input}" \
--outlier-file "${outlier_file}" \
--min-axis "${min_axis}" \
--max-axis "${max_axis}" \
--pop-col "${pop_col}" \
--label-col "${label_col}" \
--pops "${",".join([str(x) for x in pops])}" \
--k ${k} \
--output-pc-plot "${_output[0]}" \
--output-scree-plot "${_output[1]}" \
--output-scree-text "${_output[2]}" \
--numThreads ${numThreads}
Detect outliers#
# Calculate Mahalanobis distance per population and report outliers
[detect_outliers]
# Set the probability to remove outliers eg 0.95 or 0.997
parameter: prob = 0.997
# Mahalanobis distance p-value cutoff
parameter: pval = 0.05
# Robust Mahalanobis to outliers
parameter: robust = True
parameter: pca_result = path
input: pca_result
output: distance=f'{_input:n}.mahalanobis.rds',
identified_outliers=f'{_input:n}.outliers',
analysis_summary=f'{_input:n}.analysis_summary.md',
qqplot_mahalanobis=f'{_input:n}.mahalanobis_qq.png',
hist_mahalanobis=f'{_input:n}.mahalanobis_hist.png'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = 1, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/genotype/PCA.R \
--step detect_outliers \
--cwd "${cwd}" \
--pca-result "${_input}" \
--prob ${prob} \
--pval ${pval} \
--robust ${"TRUE" if robust else "FALSE"} \
--pop-col "${pop_col}" \
--k ${k} \
--distance-output "${_output['distance']}" \
--identified-outliers-output "${_output['identified_outliers']}" \
--analysis-summary-output "${_output['analysis_summary']}" \
--qqplot-output "${_output['qqplot_mahalanobis']}" \
--hist-output "${_output['hist_mahalanobis']}" \
--numThreads ${numThreads}
Add plot and outlier detection to PCA steps#
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.
[flashpca_2, project_samples_2]
# Set the probability to remove outliers eg 0.95 or 0.997
parameter: prob = 0.997
# Robust Mahalanobis to outliers
parameter: robust = True
output: distance=f'{_input:n}.mahalanobis.rds',
identified_outliers=f'{_input:n}.outliers',
analysis_summary=f'{_input:n}.analysis_summary.md',
qqplot_mahalanobis=f'{_input:n}.mahalanobis_qq.png',
hist_mahalanobis=f'{_input:n}.mahalanobis_hist.png'
sos_run("detect_outliers", pca_result=_input, prob=prob, robust=robust)
[flashpca_3, project_samples_3]
input: output_from(1), output_from(2)['identified_outliers']
outliers = [x.strip() for x in open(_input[1]).readlines() if x.strip()]
output: f"{cwd}/{_input[0]:bn}.pc.png",
f"{cwd}/{_input[0]:bn}.scree.png"
sos_run("plot_pca", plot_data = _input[0], outlier_file = _input[1] if len(outliers) else path())