Independent list of variants using LD clumping#
Description#
This notebook builds a list of approximately independent (weakly linked) variants from a reference genotype panel by running PLINK LD clumping (--indep-pairwise) within each LD block, then merging the pruned variants into a single list. The resulting variant list is useful for downstream mashr analysis where a set of independent variants is required.
Synthetic-data note. This minimal working example runs on a small synthetic chr22 PLINK panel (
protocol_example.ld_genotype.chr22, 60 samples / ~18k variants) built from the toy genotype VCF. It is for demonstration only and is not real individual-level data.
Input#
--genotype-list: a tab-separated file (no header). Each row is an LD block label in the first column followed by one or more PLINK bfile prefixes (.bed, with matching.bim/.fam). Example:input/ld_prune/protocol_example.ld_genotype.list.--cwd: output directory (defaultoutput/ld_pruned).--window/--shift/--r2: PLINK--indep-pairwiseparameters (defaults50,10,0.001).
Output#
Per-block
*.prune.infiles (variants kept by LD clumping) under<cwd>/LD_pruning/.A merged list of independent variants
LD_pruned_variants.txtunder<cwd>/LD_pruning/.
Steps#
Step 1. Run the full LD_pruning workflow. It performs LD clumping per block (LD_pruning_1) and then merges the surviving variants into a single list (LD_pruning_2).
Timing: ~5-10 min on typical compute infrastructure.
sos run pipeline/ld_prune_reference.ipynb LD_pruning \
--genotype-list input/ld_prune/protocol_example.ld_genotype.list \
--cwd output/ld_pruned
Command interface#
sos run pipeline/ld_prune_reference.ipynb -h
Workflow implementation#
The [global] section declares parameters (output directory and resource options; the empty container and the container=container task option are kept for parity with the protocol). [LD_pruning_1] reads the genotype list and runs PLINK --indep-pairwise per block; [LD_pruning_2] merges all *.prune.in files and reformats the variant IDs into a single table.
[global]
parameter: cwd = path("output")
parameter: container = ''
parameter: job_size = 1
parameter: walltime = "1h"
parameter: mem = "80G"
parameter: numThreads = 10
[LD_pruning_1]
parameter: genotype_list = path
parameter: window = 50
parameter: shift = 10
parameter: r2 = 1e-3
import pandas as pd
geno_path = pd.read_csv(genotype, sep = "\t")
input_df = geno_path.values.tolist()
input_blocks = [x[0] for x in input_df]
input_files = [x[1:] for x in input_df]
input: input_files, group_by = 1, group_with = "input_blocks"
output: prune = f'{cwd}/{step_name[:-2]}/{_input_blocks}.prune.in'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
plink2 \
--bfile ${_input:anr} \
--indep-pairwise ${window} ${shift} ${r2} \
--rm-dup force-first \
--out ${_output["prune"]:annr} \
--threads ${numThreads}
[LD_pruning_2]
input: group_by = "all"
output: f'{cwd}/{step_name[:-2]}/LD_pruned_variants.txt'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container
library(data.table)
library(dplyr)
library(stringr)
per_block_variants = c(${_input:ar,})
uncorrelated_variants = as.data.frame(rbindlist(lapply(per_block_variants, fread, header = FALSE)))
random_variants = str_split(gsub("_",":",uncorrelated_variants$V1),":",simplify=TRUE)%>%data.frame()%>%
rename("chrom" = "X1","pos" = "X2","alt" = "X3","ref"="X4")%>%
mutate(variant_id = paste(chrom, pos, alt, ref, sep = ":"))
fwrite(random_variants, ${_output:r}, row.names=F,sep="\t",quote=F)
Troubleshooting#
Problem |
Likely cause |
Fix |
|---|---|---|
|
bfile prefix in the genotype list is wrong or |
Ensure all three PLINK files exist next to the |
Empty |
|
Loosen |
|
|
Install the R packages in the active environment |
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.