Independent list of variants using LD clumping

Independent list of variants using LD clumping#

Description#

This notebook builds a list of approximately independent (weakly linked) variants from a reference genotype panel by running PLINK LD clumping (--indep-pairwise) within each LD block, then merging the pruned variants into a single list. The resulting variant list is useful for downstream mashr analysis where a set of independent variants is required.

Synthetic-data note. This minimal working example runs on a small synthetic chr22 PLINK panel (protocol_example.ld_genotype.chr22, 60 samples / ~18k variants) built from the toy genotype VCF. It is for demonstration only and is not real individual-level data.

Input#

--genotype-list: a tab-separated file (no header). Each row is an LD block label in the first column followed by one or more PLINK bfile prefixes (.bed, with matching .bim/.fam). Example: input/ld_prune/protocol_example.ld_genotype.list.
--cwd: output directory (default output/ld_pruned).
--window / --shift / --r2: PLINK --indep-pairwise parameters (defaults 50, 10, 0.001).

Output#

Per-block *.prune.in files (variants kept by LD clumping) under <cwd>/LD_pruning/.
A merged list of independent variants LD_pruned_variants.txt under <cwd>/LD_pruning/.

Steps#

Step 1. Run the full LD_pruning workflow. It performs LD clumping per block (LD_pruning_1) and then merges the surviving variants into a single list (LD_pruning_2).

Timing: ~5-10 min on typical compute infrastructure.

sos run pipeline/ld_prune_reference.ipynb LD_pruning \
    --genotype-list input/ld_prune/protocol_example.ld_genotype.list \
    --cwd output/ld_pruned

Command interface#

sos run pipeline/ld_prune_reference.ipynb -h

Workflow implementation#

The [global] section declares parameters (output directory and resource options; the empty container and the container=container task option are kept for parity with the protocol). [LD_pruning_1] reads the genotype list and runs PLINK --indep-pairwise per block; [LD_pruning_2] merges all *.prune.in files and reformats the variant IDs into a single table.

[global]
parameter: cwd = path("output")
parameter: container = ''
parameter: job_size = 1
parameter: walltime = "1h"
parameter: mem = "80G"
parameter: numThreads = 10

[LD_pruning_1]
parameter: genotype_list = path
parameter: window = 50
parameter: shift = 10
parameter: r2 = 1e-3

import pandas as pd
geno_path = pd.read_csv(genotype, sep = "\t")
input_df = geno_path.values.tolist()
input_blocks = [x[0] for x in input_df]
input_files = [x[1:] for x in input_df] 

input: input_files, group_by = 1, group_with = "input_blocks"
output: prune = f'{cwd}/{step_name[:-2]}/{_input_blocks}.prune.in'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    plink2 \
        --bfile  ${_input:anr} \
        --indep-pairwise  ${window} ${shift} ${r2} \
        --rm-dup force-first \
        --out ${_output["prune"]:annr} \
        --threads ${numThreads}

[LD_pruning_2]
input: group_by = "all"
output: f'{cwd}/{step_name[:-2]}/LD_pruned_variants.txt'
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container
    library(data.table)
    library(dplyr)
    library(stringr)
    per_block_variants = c(${_input:ar,})
    uncorrelated_variants = as.data.frame(rbindlist(lapply(per_block_variants, fread, header = FALSE)))
    random_variants  = str_split(gsub("_",":",uncorrelated_variants$V1),":",simplify=TRUE)%>%data.frame()%>%
                    rename("chrom" = "X1","pos" = "X2","alt" = "X3","ref"="X4")%>%
                    mutate(variant_id = paste(chrom, pos, alt, ref, sep = ":"))
    fwrite(random_variants, ${_output:r}, row.names=F,sep="\t",quote=F)

Troubleshooting#

Problem	Likely cause	Fix
`Error: Failed to open ... .bed`	bfile prefix in the genotype list is wrong or `.bim`/`.fam` missing	Ensure all three PLINK files exist next to the `.bed` named in the list
Empty `*.prune.in`	`--r2` threshold too strict for the toy data	Loosen `--r2` (e.g. `0.2`) or widen `--window`
`LD_pruning_2` R error about packages	`data.table`/`dplyr`/`stringr` not installed	Install the R packages in the active environment

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.