Genotype PLINK File Quality Control#
This workflow implements some preliminary data QC steps for PLINK input files. It supports both PLINK1 binary format (BED/BIM/FAM) and PLINK2 format (PGEN/PVAR/PSAM). VCF format of inputs will be converted to PLINK before performing QC.
Description#
This notebook includes workflow for
Compute kinship matrix in sample and estimate related individuals
Genotype and sample QC: by MAF, missing data and HWE
LD pruning for follow up PCA analysis on genotype, as needed
A potential limitation is that the workflow requires all samples and chromosomes to be merged as one single file, in order to perform both sample and variant level QC. However, in our experience using this pipeline with 200K exomes with 15 million variants, this pipeline works on the single merged PLINK file.
Methods#
Depending on the context of your problem, the workflow can be executed in two ways:
Run
qc
command to perform genotype data QC and LD pruning to generate a subset of variants in preparation for analysis such as PCA.Run
king
first on either the original or a subset of common variants to identify unrelated individuals. Theking
pipeline will split samples to related and unrelated individuals. Then you performqc
on these individuals only and finally extract the same set of QC-ed variants for related individuals.
Default Parameters: QC#
Kinship coefficient for related individuals: 0.0625
MAF and MAC default: 0
Above default includes both common and are variant
Recommand MAF for PCA: 0.01, we should stick to common variants
Recommand MAC for single variant analysis: 5.
Variant level missingness threshold: 0.1
Sample level missingness threshold: 0.1
LD pruning via PLINK for PCA analysis:
window 50
shift 10
r2 0.1
HWE default: 1E-15 which is very lenient
Input#
The genotype files in either PLINK1 binary format (BED/BIM/FAM) or PLINK2 format (PGEN/PVAR/PSAM). For input in VCF format and/or per-chromosome VCF or PLINK format, please use vcf_to_plink
and merge_plink
in genotype formatting pipeline to convert them to PLINK file bundle.
Minimal Working Example#
Minimal working example data-set as well as the singularity container bioinfo.sif
can be downloaded from Synapse.
The chr1_chr6
data-set was merged from chr1
and chr6
data, using merge_plink
command from genotype formatting pipeline.
iii. Perform QC on both rare and common variants#
sos run xqtl-protocol/pipeline/GWAS_QC.ipynb qc_no_prune \
--cwd Genotype \
--genoFile Genotype/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.bed \
--geno-filter 0.1 \
--mind-filter 0.1 \
--hwe-filter 1e-08 \
--mac-filter 0
v. Sample match with genotype#
Timing: <1 min
sos run pipeline/GWAS_QC.ipynb genotype_phenotype_sample_overlap \
--cwd output/sample_meta \
--genoFile input/protocol_example.genotype.chr21_22.fam \
--phenoFile input/protocol_example.protein.csv
INFO: Running genotype_phenotype_sample_overlap: This workflow extracts overlapping samples for genotype data with phenotype data, and output the filtered sample genotype list as well as sample phenotype list
INFO: genotype_phenotype_sample_overlap is completed.
INFO: genotype_phenotype_sample_overlap output: /Users/alexmccreight/xqtl-protocol/output/sample_meta/protocol_example.protein.sample_overlap.txt /Users/alexmccreight/xqtl-protocol/output/sample_meta/protocol_example.protein.sample_genotypes.txt
INFO: Workflow genotype_phenotype_sample_overlap (ID=w71b4e35979654867) is executed successfully with 1 completed step.
vi. Kinship QC#
To accuratly estimate the PCs for the genotype. We split participants based on their kinship coefficients, estimated by KING.
Variant level and sample level QC on unrelated individuals using missingness > 10%, and LD-prunning in preparation for PCA analysis.
There is no related samples in these ROSMAP samples, so there is an additional step to only keep those samples in
rosmap_pheno.sample_genotypes.txt
to do PCA.
Be aware:
If the message from king_2
shown as No related individuals detected from *.kin0
, this means no related individuals detected for the samples in --keep_samples
. In this case, there will be no output for related individuals from this step.
Timing: <2 min
sos run pipeline/GWAS_QC.ipynb king \
--cwd output/kinship \
--genoFile input/protocol_example.genotype.chr21_22.bed \
--name pQTL \
--keep-samples output/sample_meta/protocol_example.protein.sample_genotypes.txt
INFO: Running king_1: Inference of relationships in the sample to identify closely related individuals
INFO: king_1 is completed.
INFO: king_1 output: /Users/alexmccreight/xqtl-protocol/output/kinship/protocol_example.genotype.chr21_22.pQTL.kin0
INFO: Running king_2: Select a list of unrelated individual with an attempt to maximize the unrelated individuals selected from the data
INFO: king_2 is completed.
INFO: king_2 output: /Users/alexmccreight/xqtl-protocol/output/kinship/protocol_example.genotype.chr21_22.pQTL.related_id
INFO: Running king_3: Split genotype data into related and unrelated samples, if related individuals are detected
INFO: king_3 is completed.
INFO: king_3 output: /Users/alexmccreight/xqtl-protocol/output/kinship/protocol_example.genotype.chr21_22.pQTL.unrelated.bed /Users/alexmccreight/xqtl-protocol/output/kinship/protocol_example.genotype.chr21_22.pQTL.related.bed
INFO: Workflow king (ID=w7fad1d8b027ec781) is executed successfully with 3 completed steps.
Command Interface#
sos run GWAS_QC.ipynb -h
usage: sos run GWAS_QC.ipynb [workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
king
qc_no_prune
qc
genotype_phenotype_sample_overlap
Global Workflow Options:
--cwd output (as path)
the output directory for generated files
--name ''
A string to identify your analysis run
--genoFile paths
PLINK binary files
--remove-samples . (as path)
The path to the file that contains the list of samples
to remove (format FID, IID)
--keep-samples . (as path)
The path to the file that contains the list of samples
to keep (format FID, IID)
--keep-variants . (as path)
The path to the file that contains the list of variants
to keep
--exclude-variants . (as path)
The path to the file that contains the list of variants
to exclude
--kinship 0.0625 (as float)
Kinship coefficient threshold for related individuals
(e.g first degree above 0.25, second degree above 0.125,
third degree above 0.0625)
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 5h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 20 (as int)
Number of threads
--container ''
Software container option
--entrypoint ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
Sections
king_1: Inference of relationships in the sample to identify
closely related individuals
Workflow Options:
--kin-maf 0.01 (as float)
PLINK binary file
king_2: Select a list of unrelated individual with an attempt to
maximize the unrelated individuals selected from the
data
king_3: Split genotype data into related and unrelated samples,
if related individuals are detected
qc_no_prune, qc_1: Filter SNPs and select individuals
Workflow Options:
--maf-filter 0.0 (as float)
minimum MAF filter to use. 0 means do not apply this
filter.
--maf-max-filter 0.0 (as float)
maximum MAF filter to use. 0 means do not apply this
filter.
--mac-filter 0.0 (as float)
minimum MAC filter to use. 0 means do not apply this
filter.
--mac-max-filter 0.0 (as float)
maximum MAC filter to use. 0 means do not apply this
filter.
--geno-filter 0.1 (as float)
Maximum missingess per-variant
--mind-filter 0.1 (as float)
Maximum missingness per-sample
--hwe-filter 1e-15 (as float)
HWE filter -- a very lenient one
--other-args (as list)
Other PLINK arguments e.g snps_only, write-samples, etc
--[no-]meta-only (default to False)
Only output SNP and sample list, rather than the PLINK
binary format of subset data
--[no-]rm-dups (default to False)
Remove duplicate variants
qc_2: LD prunning and remove related individuals (both ind of
a pair) Plink2 has multi-threaded calculation for LD
prunning
Workflow Options:
--window 50 (as int)
Window size
--shift 10 (as int)
Shift window every 10 snps
--r2 0.1 (as float)
genotype_phenotype_sample_overlap: This workflow extracts overlapping samples
for genotype data with phenotype data, and output the
filtered sample genotype list as well as sample
phenotype list
Workflow Options:
--phenoFile VAL (as path, required)
A phenotype file, can be bed.gz or tsv
--sample-participant-lookup . (as path)
If this file is provided, a genotype/phenotype sample
name match will be performed It must contain two column
names: genotype_id, sample_id
[global]
# the output directory for generated files
parameter: cwd = path("output")
# A string to identify your analysis run
parameter: name = ""
# PLINK binary files (either BED/BIM/FAM or PGEN/PVAR/PSAM format)
parameter: genoFile = paths
# The path to the file that contains the list of samples to remove (format FID, IID)
parameter: remove_samples = path('.')
# The path to the file that contains the list of samples to keep (format FID, IID)
parameter: keep_samples = path('.')
# The path to the file that contains the list of variants to keep
parameter: keep_variants = path('.')
# The path to the file that contains the list of variants to exclude
parameter: exclude_variants = path('.')
# Kinship coefficient threshold for related individuals
# (e.g first degree above 0.25, second degree above 0.125, third degree above 0.0625)
parameter: kinship = 0.0625
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
# Software container option
parameter: container = ""
parameter: entrypoint= ""
# use this function to edit memory string for PLINK input
from sos.utils import expand_size
cwd = path(f"{cwd:a}")
# Determine if the file is in PLINK1 (BED/BIM/FAM) or PLINK2 (PGEN/PVAR/PSAM) format
def determine_plink_format(file_path):
"""
Determine the PLINK file format based on file extensions and companion files.
Args:
file_path (str or Path): Path to the input file
Returns:
str: 'plink1' or 'plink2'
"""
# Convert to string if it's a Path object
file_path = str(file_path)
# Check direct file extensions
if file_path.endswith('.bed'):
return 'plink1'
elif file_path.endswith('.pgen'):
return 'plink2'
# If the file doesn't have a standard extension, try to infer format
try:
# Remove the file extension if present
base_path = file_path.rsplit('.', 1)[0] if '.' in file_path else file_path
# Check for PLINK1 companion files
plink1_companion_files = [
f"{base_path}.bim",
f"{base_path}.fam"
]
# Check for PLINK2 companion files
plink2_companion_files = [
f"{base_path}.pvar",
f"{base_path}.psam"
]
# Check PLINK1 format
if all(os.path.exists(f) for f in plink1_companion_files):
return 'plink1'
# Check PLINK2 format
if all(os.path.exists(f) for f in plink2_companion_files):
return 'plink2'
except Exception as e:
print(f"Error determining PLINK format: {e}")
# Default to PLINK1 if can't determine
return 'plink1'
# Get the appropriate PLINK command based on the input file format
def get_plink_command_prefix(file_path):
format_type = determine_plink_format(file_path)
if format_type == 'plink1':
return "--bfile"
else: # plink2
return "--pfile"
# Generate the appropriate file extension based on the requested format
def get_output_extension(output_format, is_prune=False):
if output_format == 'plink1':
return '.bed' if not is_prune else '.prune.bed'
else: # plink2
return '.pgen' if not is_prune else '.prune.pgen'
# Choose the make-bed or make-pgen command based on desired output format
def get_make_command(output_format):
if output_format == 'plink1':
return '--make-bed'
else: # plink2
return '--make-pgen'
Estimate kinship in the sample#
The output is a list of related individuals, as well as the kinship matrix
# Inference of relationships in the sample to identify closely related individuals
[king_1]
# PLINK binary file
parameter: kin_maf = 0.01
input: genoFile
output: f'{cwd}/{_input:bn}{("."+name) if name else ""}.kin0'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
plink_command = get_plink_command_prefix(genoFile)
bash: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', entrypoint=entrypoint
plink2 \
${plink_command} ${_input:n} \
--make-king-table \
--king-table-filter ${kinship} \
${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} \
${('--remove %s' % remove_samples) if remove_samples.is_file() else ""} \
--min-af ${kin_maf} \
--max-af ${1-kin_maf} \
--out ${_output:n} \
--threads ${numThreads} \
--memory ${int(expand_size(mem) * 0.9)/1e6}
bash: expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint=entrypoint
i="${_output}"
output_size=$(ls -lh $i | cut -f 5 -d ' ')
output_rows=$(cat $i | wc -l | cut -f 1 -d ' ')
output_column=$(cat $i | head -1 | wc -w)
output_preview=$(cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6)
printf "output_info: %s\noutput_size: %s\noutput_rows: %s\noutput_column: %s\noutput_preview:\n%s\n" \
"$i" "$output_size" "$output_rows" "$output_column" "$output_preview"
# Select a list of unrelated individual with an attempt to maximize the unrelated individuals selected from the data
[king_2: shared = "related_id" ]
related_id = [x.strip() for x in open(_input).readlines() if not x.startswith("#")]
output: f'{_input:n}.related_id'
with open(_output, 'a'):
pass
done_if(len(related_id) == 0, msg = f"No related individuals detected from {_input}.")
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', entrypoint=entrypoint
# By Rui Dong and Derek Lamb, Columbia Neurology
# This function enhances the standard plinkQC::relatednessFilter approach through:
#
# 1. Graph-based preprocessing:
# - For large relatedness networks, breaks down dense clusters of related individuals
# - Removes highly connected nodes (individuals with many relatives) first
# - Prevents memory issues when processing large components (>20 individuals)
# - Reduces the input size for the more intensive plinkQC filtering
#
# 2. Additional checks:
# - Verifies that all related pairs are properly handled
# - Runs up to 20 iterations to ensure complete filtering
# - Catches edge cases that might be missed in a single pass
#
# This implementation is particularly helpful for datasets with large family
# clusters or complex relatedness structures where the standard approach
# might struggle.
filter_relatedness <- function(
relatedness, # Path to the relatedness file (required)
relatednessTh = 0.0625, # Threshold for relatedness
analysis_type = "maximize_unrelated", # Type of analysis: "maximize_unrelated" or "maximize_cases"
output_prefix = "output_prefix", # Base name for output file
otherCriterion = NULL, # Optional additional criterion for filtering
otherCriterionTh = NULL, # Threshold for the additional criterion
otherCriterionThDirection = "ge", # Direction of the threshold
relatednessIID1 = "IID1", # Column name for first individual ID
relatednessIID2 = "IID2", # Column name for second individual ID
relatednessFID1 = NULL, # Column name for first family ID
relatednessFID2 = NULL, # Column name for second family ID
relatednessRelatedness = "PI_HAT", # Column name for relatedness value
pheno_file = NULL, # File for phenotype information
pheno_col = "pheno", # Column name for phenotype
otherCriterionIID = "IID", # Column name for additional criterion ID
otherCriterionMeasure = NULL, # Column name for additional criterion measure
verbose = FALSE # Whether to print verbose output
) {
# Load required packages
suppressMessages({
library(tidyverse)
library(igraph)
library(plinkQC)
library(data.table)
})
# Validate inputs
if (is.null(relatedness)) {
stop("Relatedness file path is required")
}
if (analysis_type == "maximize_cases" && is.null(pheno_file)) {
stop("Must provide phenotype file when analysis_type is 'maximize_cases'")
}
# Read the relatedness file
full_kin0 <- as.data.frame(relatedness)
# Initial pruning of highly related individuals
working_graph <- full_kin0 |>
filter(!!sym(relatednessRelatedness) >= relatednessTh) |>
select(!!sym(relatednessIID1), !!sym(relatednessIID2)) |>
graph_from_data_frame(directed = FALSE)
working_comp <- components(working_graph)
# Set parameters
graph_size_th = 20
reduce_fraction = 0.05
high_related_indiv <- c()
# Reduce components to max 20
while (max(working_comp$csize > graph_size_th)) {
if (verbose) {
print(paste0("The largest component graph has ", max(working_comp$csize),
" individuals. Removing the most connected ",
scales::percent(reduce_fraction), " of them."))
}
# Prepare for loop
large_comp_ids <- which(working_comp$csize > graph_size_th)
nodes_to_remove <- c()
for (comp_id in large_comp_ids) {
# Get nodes/degrees for component
comp_nodes <- V(working_graph)[working_comp$membership == comp_id]
comp_degrees <- degree(working_graph, v = comp_nodes)
# Remove highly related individuals
num_to_remove <- ceiling(length(comp_nodes) * reduce_fraction)
high_degree_nodes <- names(sort(comp_degrees, decreasing = TRUE))[1:num_to_remove]
# Append to removal list
nodes_to_remove <- c(nodes_to_remove, high_degree_nodes)
}
# Update for next loop
high_related_indiv <- c(high_related_indiv, nodes_to_remove)
working_graph <- delete_vertices(working_graph, nodes_to_remove)
working_comp <- components(working_graph)
}
kin0 <- full_kin0 |>
filter(!(!!sym(relatednessIID1) %in% high_related_indiv) &
!(!!sym(relatednessIID2) %in% high_related_indiv))
# Process based on analysis type
if (analysis_type == "maximize_unrelated") {
# Run the plinkQC relatedness Filter
rel <- plinkQC::relatednessFilter(
relatedness = kin0,
otherCriterion = otherCriterion,
relatednessTh = relatednessTh,
relatednessIID1 = relatednessIID1,
relatednessIID2 = relatednessIID2,
otherCriterionTh = otherCriterionTh,
otherCriterionThDirection = otherCriterionThDirection,
relatednessFID1 = relatednessFID1,
relatednessFID2 = relatednessFID2,
relatednessRelatedness = relatednessRelatedness,
otherCriterionIID = otherCriterionIID,
otherCriterionMeasure = otherCriterionMeasure,
verbose = verbose
)$failIDs
all_exclude <- rel$IID
} else if (analysis_type == "maximize_cases") {
# Load phenotype information
related_individuals <- unique(c(kin0 |> pull(relatednessIID1), kin0 |> pull(relatednessIID2)))
df_pheno <- fread(pheno_file) |>
drop_na(!!sym(pheno_col)) |>
filter(IID %in% related_individuals)
# Match kinship and phenotype information
related_pheno_individuals <- df_pheno |>
pull(IID)
related_cases <- df_pheno |>
filter(!!sym(pheno_col) == 1) |>
pull(IID)
related_controls <- df_pheno |>
filter(!!sym(pheno_col) == 0) |>
pull(IID)
kin0 <- kin0 |>
filter(!!sym(relatednessIID1) %in% related_pheno_individuals &
!!sym(relatednessIID2) %in% related_pheno_individuals)
# Optimize cases
df_case_kin <- kin0 |>
filter(!!sym(relatednessIID1) %in% related_cases &
!!sym(relatednessIID2) %in% related_cases)
rel_cases <- plinkQC::relatednessFilter(
relatedness = df_case_kin,
otherCriterion = otherCriterion,
relatednessTh = relatednessTh,
relatednessIID1 = relatednessIID1,
relatednessIID2 = relatednessIID2,
otherCriterionTh = otherCriterionTh,
otherCriterionThDirection = otherCriterionThDirection,
relatednessFID1 = relatednessFID1,
relatednessFID2 = relatednessFID2,
relatednessRelatedness = relatednessRelatedness,
otherCriterionIID = otherCriterionIID,
otherCriterionMeasure = otherCriterionMeasure,
verbose = verbose
)$failIDs
# Remove controls related to cases
related_cases_keep <- setdiff(related_cases, rel_cases$IID)
related_controls_exclude <- c()
for (i in 1:nrow(kin0)) {
iid1 <- kin0 |> pull(!!sym(relatednessIID1)) |> nth(i)
iid2 <- kin0 |> pull(!!sym(relatednessIID2)) |> nth(i)
if (iid1 %in% related_cases_keep & iid2 %in% related_controls) {
related_controls_exclude <- c(related_controls_exclude, iid2)
} else if(iid2 %in% related_cases_keep & iid1 %in% related_controls) {
related_controls_exclude <- c(related_controls_exclude, iid1)
}
}
# Subset controls to only unrelated controls
related_controls_keep <- setdiff(related_controls, related_controls_exclude)
df_control_kin <- kin0 |>
filter(!!sym(relatednessIID1) %in% related_controls_keep &
!!sym(relatednessIID2) %in% related_controls_keep)
rel_controls <- plinkQC::relatednessFilter(
relatedness = df_control_kin,
otherCriterion = otherCriterion,
relatednessTh = relatednessTh,
relatednessIID1 = relatednessIID1,
relatednessIID2 = relatednessIID2,
otherCriterionTh = otherCriterionTh,
otherCriterionThDirection = otherCriterionThDirection,
relatednessFID1 = relatednessFID1,
relatednessFID2 = relatednessFID2,
relatednessRelatedness = relatednessRelatedness,
otherCriterionIID = otherCriterionIID,
otherCriterionMeasure = otherCriterionMeasure,
verbose = verbose
)$failIDs
# Handle leftover individuals
all_exclude <- c(rel_cases$IID, related_controls_exclude, rel_controls$IID)
} else {
# Directly filter on kinship threshold
rel <- kin0 %>% dplyr::filter(!!sym(relatednessRelatedness) >= relatednessTh)
all_exclude <- unique(c(rel |> pull(relatednessIID1), rel |> pull(relatednessIID2)))
}
# Check for leftovers
df_related <- kin0 |>
filter(!(!!sym(relatednessIID1) %in% all_exclude) &
!(!!sym(relatednessIID2) %in% all_exclude)) |>
filter(!!sym(relatednessRelatedness) > relatednessTh)
print("df_related")
print(df_related)
if (nrow(df_related) > 0) {
if (verbose) {
print(paste0("Warning: After first plinkQC pass, there are still ",
nrow(df_related), " related subjects in dataset."))
}
# Iterate until all subjects are removed
iter <- 0
while (nrow(df_related) > 0 & iter < 20) {
# Rerun plinkQC
additional_exclude <- plinkQC::relatednessFilter(
relatedness = df_related,
otherCriterion = otherCriterion,
relatednessTh = relatednessTh,
relatednessIID1 = relatednessIID1,
relatednessIID2 = relatednessIID2,
otherCriterionTh = otherCriterionTh,
otherCriterionThDirection = otherCriterionThDirection,
relatednessFID1 = relatednessFID1,
relatednessFID2 = relatednessFID2,
relatednessRelatedness = relatednessRelatedness,
otherCriterionIID = otherCriterionIID,
otherCriterionMeasure = otherCriterionMeasure,
verbose = verbose
)$failIDs
# Add to exclude list
all_exclude <- c(all_exclude, additional_exclude$IID)
# Update df_related and iter
df_related <- kin0 |>
filter(!(!!sym(relatednessIID1) %in% all_exclude) &
!(!!sym(relatednessIID2) %in% all_exclude)) |>
filter(!!sym(relatednessRelatedness) > relatednessTh)
iter <- iter + 1
}
# Print status
if (verbose) {
if (nrow(df_related) == 0) {
print(paste0("All related subjects successfully removed after ", iter, " iterations."))
} else if (nrow(df_related) > 0) {
stop(paste0("Error: After 20 plinkQC passes, there are still ",
nrow(df_related), " subjects in dataset."))
}
}
}
# Add in high-related individuals
all_exclude <- c(all_exclude, high_related_indiv)
# List of all excluded subjects
dat <- data.frame(IID = all_exclude, FID = as.character(all_exclude)) # Because IID and FID are the same
# Output the related ids
output_related <- paste0(output_prefix, "_kinship_cutoff_", relatednessTh, ".related_id")
write.table(dat, output_related, quote = FALSE, row.names = FALSE, col.names = FALSE)
if (verbose) {
cat("\n[INFO] There are", nrow(dat), "excluded individuals identified using a kinship threshold of", relatednessTh, "\n")
cat("[OUTPUT] Excluded individual IDs have been written to the file:\n")
cat(" ", normalizePath(output_related), "\n")
}
# Return the excluded IDs
return(dat)
}
# main code
suppressMessages({
library(tidyverse)
library(data.table)
})
# Read input and set column names to match your original code
kin0 <- read.table(${_input:r}, header=FALSE, stringsAsFactors=FALSE)
colnames(kin0) <- c("FID1","ID1","FID2","ID2","NSNP","HETHET","IBS0","KINSHIP")
# Always use the filter_relatedness function with maximize_unrelated
result <- filter_relatedness(
relatedness = kin0,
relatednessTh = ${kinship},
analysis_type = "maximize_unrelated",
relatednessIID1 = "ID1",
relatednessIID2 = "ID2",
relatednessFID1 = "FID1",
relatednessFID2 = "FID2",
relatednessRelatedness = "KINSHIP",
verbose = FALSE
)
# Format output to match your original code
tmp1 <- kin0[,1:2]
tmp2 <- kin0[,3:4]
colnames(tmp1) = colnames(tmp2) = c("FID", "ID")
lookup <- dplyr::distinct(rbind(tmp1,tmp2))
dat <- lookup[which(lookup[,2] %in% result$IID),]
# Write output in the exact format of the original script
cat("There are", nrow(dat), "related individuals using a kinship threshold of ${kinship}\n")
write.table(dat, ${_output:r}, quote=FALSE, row.names=FALSE, col.names=FALSE)
bash: expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container, entrypoint=entrypoint
i="${_output}"
output_size=$(ls -lh $i | cut -f 5 -d ' ')
output_rows=$(cat $i | wc -l | cut -f 1 -d ' ')
output_column=$(cat $i | head -1 | wc -w)
output_preview=$(cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6)
printf "output_info: %s\noutput_size: %s\noutput_rows: %s\noutput_column: %s\noutput_preview:\n%s\n" \
"$i" "$output_size" "$output_rows" "$output_column" "$output_preview"
# Split genotype data into related and unrelated samples, if related individuals are detected
[king_3]
depends: sos_variable("related_id")
input: output_from(2), genoFile
plink_command = get_plink_command_prefix(_input[1])
output_format = determine_plink_format(_input[1])
make_command = get_make_command(output_format)
output_ext = get_output_extension(output_format)
output: unrelated_bed = f'{cwd}/{_input[0]:bn}.unrelated{output_ext}',
related_bed = f'{cwd}/{_input[0]:bn}.related{output_ext}'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
plink2 \
${plink_command} ${_input[1]:n} \
--remove ${_input[0]} \
${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} \
${make_command} \
--out ${_output[0]:n} \
--threads ${numThreads} \
--memory ${int(expand_size(mem) * 0.9)/1e6} --new-id-max-allele-len 1000 --set-all-var-ids chr@:#_\$r_\$a
if [ ${len(related_id)} -ne 0 ] ; then
plink2 \
${plink_command} ${_input[1]:n} \
--keep ${_input[0]} \
${make_command} \
--out ${_output[1]:n} \
--threads ${numThreads} \
--memory ${int(expand_size(mem) * 0.9)/1e6} --new-id-max-allele-len 1000 --set-all-var-ids chr@:#_\$r_\$a
else
touch ${_output[1]}
fi
Genotype and sample QC#
QC the genetic data based on MAF, sample and variant missigness and Hardy-Weinberg Equilibrium (HWE).
In this step you may also provide a list of samples to keep, for example in the case when you would like to subset a sample based on their ancestries to perform independent analyses on each of these groups.
The default parameters are set to reflect some suggestions in Table 1 of this paper.
# Filter SNPs and select individuals
[qc_no_prune, qc_1 (basic QC filters)]
# minimum MAF filter to use. 0 means do not apply this filter.
parameter: maf_filter = 0.0
# maximum MAF filter to use. 0 means do not apply this filter.
parameter: maf_max_filter = 0.0
# minimum MAC filter to use. 0 means do not apply this filter.
parameter: mac_filter = 0.0
# maximum MAC filter to use. 0 means do not apply this filter.
parameter: mac_max_filter = 0.0
# Maximum missingess per-variant
parameter: geno_filter = 0.1
# Maximum missingness per-sample
parameter: mind_filter = 0.1
# HWE filter -- a very lenient one
parameter: hwe_filter = 1e-15
# Other PLINK arguments e.g snps_only, write-samples, etc
parameter: other_args = []
# Only output SNP and sample list, rather than the PLINK binary format of subset data
parameter: meta_only = False
# Remove duplicate variants
parameter: rm_dups = False
# Add option to process dosage
parameter: treat_dosage_missing = False
fail_if(not (keep_samples.is_file() or keep_samples == path('.')), msg = f'Cannot find ``{keep_samples}``')
fail_if(not (keep_variants.is_file() or keep_variants == path('.')), msg = f'Cannot find ``{keep_variants}``')
fail_if(not (remove_samples.is_file() or remove_samples == path('.')), msg = f'Cannot find ``{remove_samples}``')
input: genoFile, group_by=1
plink_command = get_plink_command_prefix(_input)
output_format = determine_plink_format(_input)
make_command = get_make_command(output_format) if not meta_only else "--write-snplist --write-samples"
output_ext = get_output_extension(output_format) if not meta_only else ".snplist"
output: f'{cwd}/{_input:bn}{("." + name) if name else ""}.plink_qc{".extracted" if keep_variants.is_file() else ""}{output_ext}'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
plink2 \
${plink_command} ${_input:n} \
--allow-extra-chr \
${('--maf %s' % maf_filter) if maf_filter > 0 else ''} \
${('--max-maf %s' % maf_max_filter) if maf_max_filter > 0 else ''} \
${('--mac %s' % mac_filter) if mac_filter > 0 else ''} \
${('--max-mac %s' % mac_max_filter) if mac_max_filter > 0 else ''} \
${('--geno %s' % geno_filter) if geno_filter > 0 else ''} ${('dosage' if treat_dosage_missing else '')} \
${('--hwe %s' % hwe_filter) if hwe_filter > 0 else ''} \
${('--mind %s' % mind_filter) if mind_filter > 0 else ''} ${('dosage' if treat_dosage_missing else '')} \
${('--keep %s' % keep_samples) if keep_samples.is_file() else ""} \
${('--remove %s' % remove_samples) if remove_samples.is_file() else ""} \
${('--exclude %s' % exclude_variants) if exclude_variants.is_file() else ""} \
${('--extract %s' % keep_variants) if keep_variants.is_file() else ""} \
${make_command} \
${("") if not rm_dups else "--rm-dup force-first 'list'"} \
${paths(["--%s" % x for x in other_args]) if other_args else ""} \
--out ${_output:n} \
--threads ${numThreads} \
--memory ${int(expand_size(mem) * 0.9)/1e6} --new-id-max-allele-len 1000 --set-all-var-ids chr@:#_\$r_\$a
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint=entrypoint
i="${_output}"
output_size=$(ls -lh $i | cut -f 5 -d ' ')
printf "output_info: %s\noutput_size: %s\n" "$i" "$output_size" >> ${_output:n}.stdout
# LD prunning and remove related individuals (both ind of a pair)
# Plink2 has multi-threaded calculation for LD prunning
[qc_2 (LD pruning)]
# Window size
parameter: window = 50
# Shift window every 10 snps
parameter: shift = 10
parameter: r2 = 0.1
stop_if(r2==0)
plink_command = get_plink_command_prefix(_input)
output_format = determine_plink_format(_input)
make_command = get_make_command(output_format)
output_ext = get_output_extension(output_format, is_prune=True)
output: bed=f'{cwd}/{_input:bn}{output_ext}', prune=f'{cwd}/{_input:bn}.prune.in'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
plink2 \
${plink_command} ${_input:n} \
--allow-extra-chr \
--rm-dup force-first \
--indep-pairwise ${window} ${shift} ${r2} \
--out ${_output["prune"]:nn} \
--threads ${numThreads} \
--memory ${int(expand_size(mem) * 0.9)/1e6}
plink2 \
${plink_command} ${_input:n} \
--allow-extra-chr \
--extract ${_output['prune']} \
${make_command} \
--out ${_output['bed']:n} \
--threads ${numThreads} \
--memory ${int(expand_size(mem) * 0.9)/1e6}
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
i="${_output[0]}"
output_size=$(ls -lh $i | cut -f 5 -d ' ')
printf "output_info: %s\noutput_size: %s\n" "$i" "$output_size" >> ${_output[0]:n}.stdout
i="${_output[1]}"
output_size=$(ls -lh $i | cut -f 5 -d ' ')
output_rows=$(zcat $i | wc -l | cut -f 1 -d ' ')
output_column=$(zcat $i | head -1 | wc -w)
output_preview=$(cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6)
printf "output_info: %s\noutput_size: %s\noutput_rows: %s\noutput_column: %s\noutput_preview:\n%s\n" \
"$i" "$output_size" "$output_rows" "$output_column" "$output_preview" >> ${_output[1]}.stdout
Extract genotype based on overlap with phenotype#
This is an auxiliary step to match genotype and phenotype based on the data and look-up table. The look up table should contain two columns: sample_id
, genotype_id
. If the look up table is not provided or look-up table file not found, then we will assume the names have already been matched.
# This workflow extracts overlapping samples for genotype data with phenotype data, and output the filtered sample genotype list as well as sample phenotype list
[genotype_phenotype_sample_overlap]
# A genotype fam file
parameter: genoFile = path
# A phenotype file, can be bed.gz or tsv
parameter: phenoFile = path
# If this file is provided, a genotype/phenotype sample name match will be performed
# It must contain two column names: genotype_id, sample_id
parameter: sample_participant_lookup = path(".")
depends: executable('tabix'), executable('bgzip')
input: genoFile, phenoFile
output: f'{cwd:a}/{path(_input[1]):bn}.sample_overlap.txt', f'{cwd:a}/{path(_input[1]):bn}.sample_genotypes.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
# Load required libraries
library(dplyr)
library(readr)
library(data.table)
# Read data files; let read_delim auto-determine the delimiter
input_file <- "${_input[0]}"
# Determine if the input is a FAM file (PLINK1) or a PSAM file (PLINK2)
if (grepl("\\.psam$", input_file)) {
# For PSAM file (PLINK2 format)
genoFam <- fread(${_input[0]:ar}, header=TRUE)
# Standardize column names (PSAM typically uses '#IID' or '#FID IID')
if ("#FID" %in% colnames(genoFam) && "IID" %in% colnames(genoFam)) {
genoFam <- genoFam %>% select(`#FID`, IID)
colnames(genoFam) <- c("V1", "V2") # Make column names match PLINK1 format for consistency
} else if ("#IID" %in% colnames(genoFam)) {
genoFam <- genoFam %>% mutate(V1 = `#IID`) %>% select(V1, `#IID`)
colnames(genoFam) <- c("V1", "V2") # Make column names match PLINK1 format for consistency
} else {
# Try to handle other possible PSAM formats
if (ncol(genoFam) >= 2) {
genoFam <- genoFam[, 1:2]
colnames(genoFam) <- c("V1", "V2") # Make column names match PLINK1 format for consistency
} else {
stop("Cannot determine the structure of the PSAM file. Expected columns not found.")
}
}
} else {
# For FAM file (PLINK1 format) - assume standard format with no header
genoFam <- fread(${_input[0]:ar}, header=FALSE)
}
phenoFile <- read_delim(${_input[1]:ar}, col_names=TRUE)
if (${"TRUE" if sample_participant_lookup.is_file() else "FALSE"}) {
sample_lookup <- fread(${sample_participant_lookup:ar}, header=TRUE)
colnames(sample_lookup) <- c("sample_id", "genotype_id") # FIXME: This is for the old lookup table with columns c("sample_id", "participant_id")
# rename phenotype file according to lookup file
colnames(phenoFile)[-c(1:4)] <- phenoFile %>%
colnames() %>%
.[-c(1:4)] %>%
match(sample_lookup$sample_id) %>%
sample_lookup$genotype_id[.]
output_file <- paste0(${phenoFile:nnr}, '.rename_sample.bed')
# rename phenotype file
phenoFile$start <- as.integer(phenoFile$start)
phenoFile$end <- as.integer(phenoFile$end)
fwrite(phenoFile, output_file, sep = '\t')
# Compress the file using bgzip
system(paste("bgzip -c", output_file, ">", paste0(output_file, ".gz")))
# Create the index file using tabix
system(paste("tabix -p bed", paste0(output_file, ".gz")))
system(paste("rm", output_file))
} else {
sample_lookup <- cbind(genoFam[,2], genoFam[,2])
colnames(sample_lookup) <- c("genotype_id", "sample_id")
}
sample_lookup <- sample_lookup %>%
filter(
genotype_id %in% genoFam$V2,
sample_id %in% colnames(phenoFile)
)
genoFam %>%
filter(
V2 %in% sample_lookup$genotype_id,
) %>%
select(V1, V2) %>%
fwrite(${_output[1]:r}, col.names=FALSE, sep="\t")
sample_lookup %>%
fwrite(${_output[0]:r}, sep="\t")