RSS LD Sketch Pipeline#
Description#
This pipeline generates a stochastic genotype sample U = WᵀG from whole-genome sequencing VCF files and stores it as a PLINK2 pgen file for use as an LD reference panel with SuSiE-RSS fine-mapping.
Key idea: Rather than storing the full genotype matrix G (n × p), we compute U = WᵀG (B × p) using a random projection matrix \(W \sim N(0, 1/\sqrt{n})\). The approximate LD matrix \(R = U^T U / B \approx G^T G / n\) by the Johnson–Lindenstrauss lemma. G is never stored.
Matrix dimensions:
G : (n × p) — n individuals × p variants
W : (n × B) — projection matrix, generated once per cohort
U : (B × p) — stochastic genotype sample = WᵀG, stored in pgen
\(\hat{R}\) : (p × p) — approximate LD matrix, computed on-the-fly by SuSiE-RSS from U
The workflow has three steps run in order: generate_W (build the projection matrix), process_block (read VCF per LD block and write per-block dosage sketches), and merge_chrom (merge per-block dosages into one per-chromosome pgen).
Note on data: This example runs on a clearly-labeled synthetic toy dataset — a chr22 VCF (protocol_example.genotype.chr22.bgz, 60 individuals) and a 3-block LD-block BED (protocol_example.ld_blocks.bed). No access-controlled individual-level human genomic data is used.
Input#
LD-block BED (
--ld-block-file): tab-separated file with columnschr,start,end(0-based half-open) defining the regions to sketch. Toy file:input/rss_ld_sketch/protocol_example.ld_blocks.bed(3 chr22 blocks).VCF directory (
--vcf-base) and prefix (--vcf-prefix): bgzipped (.bgz) + tabix-indexed VCF(s) named{vcf_prefix}{chr}.*.bgz. Toy file:input/rss_ld_sketch/protocol_example.genotype.chr22.bgz(60 individuals), discovered with--vcf-base input/rss_ld_sketch --vcf-prefix protocol_example.genotype.--n-samples: number of individuals in the VCF (here 60). Must match the VCF sample count.--B: number of sketch (pseudo-)samples / projection dimension (the toy uses a small B for speed; production uses ~10000).--chrom: chromosome to process (e.g. 22; 0 = all autosomes found).--cohort-id: label used to name output files.
Filter thresholds (defaults shown in the implementation): --maf-min 0.0005, --mac-min 5, --msng-min 0.05.
Steps#
Run the three workflows in order. generate_W builds the shared projection matrix once; process_block sketches each LD block; merge_chrom assembles the per-chromosome pgen.
Timing: ~10-20 min (chr22) on typical compute infrastructure.
Step 1. Generate the projection matrix W (run once per cohort; --n-samples must equal the VCF sample count).#
sos run pipeline/rss_ld_sketch.ipynb generate_W \
--n-samples 60 \
--output-dir output/rss_ld_sketch \
--B 50 \
--seed 123 \
--cwd output/rss_ld_sketch
/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
INFO: Running generate_W:
INFO: generate_W is completed.
INFO: generate_W output: output/rss_ld_sketch/W_B50.npy
INFO: Workflow generate_W (ID=w68f63c60d8da4b5e) is executed successfully with 1 completed step.
Step 2. Process all LD blocks for the chromosome — read the VCF, filter variants, and write per-block dosage sketches U = WᵀG.#
sos run pipeline/rss_ld_sketch.ipynb process_block \
--ld-block-file input/rss_ld_sketch/protocol_example.ld_blocks.bed \
--chrom 22 \
--vcf-base input/rss_ld_sketch \
--vcf-prefix protocol_example.genotype. \
--output-dir output/rss_ld_sketch \
--W-matrix output/rss_ld_sketch/W_B50.npy \
--B 50 \
--cohort-id protocol_example. \
--cwd output/rss_ld_sketch
/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
INFO: Running process_block:
3 LD blocks queued
INFO: process_block (index=0) is completed.
INFO: process_block (index=1) is completed.
INFO: process_block (index=2) is completed.
INFO: process_block output: output/rss_ld_sketch/chr22/chr22_16000000_20000000/protocol_example..chr22_16000000_20000000.dosage.gz output/rss_ld_sketch/chr22/chr22_30000000_34000000/protocol_example..chr22_30000000_34000000.dosage.gz... (3 items in 3 groups)
INFO: Workflow process_block (ID=w8c1e759d62203ef6) is executed successfully with 1 completed step and 3 completed substeps.
Step 3. Merge the per-block dosage sketches into one per-chromosome PLINK2 pgen.#
sos run pipeline/rss_ld_sketch.ipynb merge_chrom \
--output-dir output/rss_ld_sketch \
--cohort-id protocol_example. \
--chrom 22 \
--cwd output/rss_ld_sketch
/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
INFO: Running merge_chrom:
PLINK v2.0.0-a.6.9LM 64-bit Intel (29 Jan 2025) cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to output/rss_ld_sketch/chr22/protocol_example..chr22.log.
Options in effect:
--make-pgen
--out output/rss_ld_sketch/chr22/protocol_example..chr22
--pmerge-list output/rss_ld_sketch/chr22/protocol_example..chr22_pmerge_list.txt pfile
--sort-vars
Start time: Tue Jun 23 09:52:54 2026
191527 MiB RAM detected, ~187643 available; reserving 95763 MiB for main
workspace.
Using up to 32 threads (change this with --threads).
--pmerge-list: 3 filesets specified.
--pmerge-list: 50 samples present.
--pmerge-list: Merged .psam written to
output/rss_ld_sketch/chr22/protocol_example..chr22-merge.psam .
--pmerge-list: 3 .pvar files scanned.
Concatenation job detected.
Concatenating... 673/673 variants complete.
Results written to
output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pgen +
output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pvar .
50 samples (0 females, 0 males, 50 ambiguous; 50 founders) loaded from
output/rss_ld_sketch/chr22/protocol_example..chr22-merge.psam.
673 variants loaded from
output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pvar.
Note: No phenotype data present.
Writing output/rss_ld_sketch/chr22/protocol_example..chr22.pvar ... done.
Writing output/rss_ld_sketch/chr22/protocol_example..chr22.psam ... done.
Writing output/rss_ld_sketch/chr22/protocol_example..chr22.pgen ... done.
End time: Tue Jun 23 09:52:54 2026
=== Filter Summary for chr22 ===
value
n_total 6157.0
n_passed 673.0
n_multiallelic 0.0
n_monomorphic 4701.0
n_all_na 0.0
n_high_msng 101.0
n_low_maf 0.0
n_low_mac 682.0
pct_dropped 89.1
INFO: merge_chrom is completed.
INFO: merge_chrom output: output/rss_ld_sketch/chr22/protocol_example..chr22.pgen
INFO: Workflow merge_chrom (ID=w8e5a670551e06660) is executed successfully with 1 completed step.
Command interface#
List every workflow and its parameters:
sos run pipeline/rss_ld_sketch.ipynb -h
Workflow implementation#
[global]
parameter: cwd = path("output")
parameter: job_size = 1
parameter: walltime = "24:00:00"
parameter: mem = "32G"
parameter: numThreads = 8
cwd = path(f'{cwd:a}')
[generate_W]
# Generate projection matrix $W \sim N(0, 1/\sqrt{n})$, shape (n x B).
# Run ONCE before processing any chromosome.
#
# W depends only on n (total sample size) and B -- not on any variant data.
# n_samples is passed directly as a parameter; no VCF reading is needed.
# All 22 chromosomes reuse the same W so that per-chromosome stochastic
# genotype samples can be arithmetically merged for meta-analysis.
parameter: n_samples = int
parameter: output_dir = str
parameter: B = 10000
parameter: seed = 123
import os
input: []
output: f'{output_dir}/W_B{B}.npy'
task: trunk_workers = 1, trunk_size = 1, walltime = '00:05:00', mem = '4G', cores = 1
python: expand = "${ }", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'
import numpy as np
import os
n = ${n_samples}
B = ${B}
seed = ${seed}
W_out = "${_output}"
# -- Generate $W \sim N(0, 1/\sqrt{n})$ -----------------------------
# Convention: W = np.random.normal(0, 1/np.sqrt(n), size=(n, B))
# W is shared across all chromosomes -- do not regenerate per chromosome.
print(f"Generating W ~ N(0, 1/sqrt({n})), shape ({n}, {B}), seed={seed}")
np.random.seed(seed)
W = np.random.normal(0, 1.0 / np.sqrt(n), size=(n, B)).astype(np.float32)
os.makedirs(os.path.dirname(os.path.abspath(W_out)), exist_ok=True)
np.save(W_out, W)
print(f"Saved: {W_out}")
print(f"Shape: {W.shape}, size: {os.path.getsize(W_out)/1e9:.2f} GB")
[process_block]
parameter: ld_block_file = str
parameter: chrom = 0
parameter: vcf_base = str
parameter: vcf_prefix = str
parameter: cohort_id = "ADSP.R5.EUR"
parameter: output_dir = str
parameter: W_matrix = str
parameter: B = 10000
parameter: maf_min = 0.0005
parameter: mac_min = 5
parameter: msng_min = 0.05
parameter: sample_list = ""
import os
def _read_blocks(bed, chrom_filter):
blocks = []
with open(bed) as fh:
for line in fh:
if line.startswith("#") or not line.strip():
continue
parts = line.split()
c = parts[0]
if not (c.startswith("chr") and c[3:].isdigit()):
continue
cnum = int(c[3:])
if not (1 <= cnum <= 22):
continue
if chrom_filter != 0 and cnum != chrom_filter:
continue
blocks.append({"chr": c, "start": int(parts[1]), "end": int(parts[2])})
if not blocks:
raise ValueError(f"No blocks found for chrom={chrom_filter} in {bed}")
return blocks
blocks = _read_blocks(ld_block_file, chrom)
print(f" {len(blocks)} LD blocks queued")
input: for_each = "blocks"
output: f'{output_dir}/{_blocks["chr"]}/{_blocks["chr"]}_{_blocks["start"]}_{_blocks["end"]}/{cohort_id}.{_blocks["chr"]}_{_blocks["start"]}_{_blocks["end"]}.dosage.gz'
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads
python: expand = "${ }"
import numpy as np
import os
import gzip
import sys
import atexit
from math import nan
from cyvcf2 import VCF
from os import listdir
# Block coordinates from for_each loop
chrm_str = "${_blocks['chr']}"
block_start = ${_blocks["start"]}
block_end = ${_blocks["end"]}
vcf_base = "${vcf_base}"
vcf_prefix = "${vcf_prefix}"
W_path = "${W_matrix}"
B = ${B}
maf_min = ${maf_min}
mac_min = ${mac_min}
msng_min = ${msng_min}
sample_list = "${sample_list}"
cohort_id = "${cohort_id}"
base_dir = "${output_dir}"
block_tag = f"{chrm_str}_{block_start}_{block_end}"
output_dir = os.path.join(base_dir, chrm_str, block_tag)
os.makedirs(output_dir, exist_ok=True)
log_path = os.path.join(output_dir, f"{block_tag}.log")
log_fh = open(log_path, "w")
sys.stdout = log_fh
sys.stderr = log_fh
atexit.register(log_fh.close)
# -- Load sample subset (optional) -----------------------------
sample_subset = None
if sample_list:
if not os.path.exists(sample_list):
raise FileNotFoundError(f"sample_list not found: {sample_list}")
with open(sample_list) as fh:
sample_subset = set(line.strip() for line in fh if line.strip())
print(f" Sample subset: {len(sample_subset):,} samples")
# -- Helpers ---------------------------------------------------
def get_vcf_files(chrm_str):
files = sorted([
os.path.join(vcf_base, x)
for x in listdir(vcf_base)
if x.endswith(".bgz") and (
x.startswith(vcf_prefix + chrm_str + ":") or
x.startswith(vcf_prefix + chrm_str + ".")
)
])
if not files:
raise FileNotFoundError(f"No VCF files for {chrm_str} in {vcf_base}")
return files
def open_vcf(vf, sample_subset):
"""Open a VCF file, applying sample subset if provided."""
vcf = VCF(vf)
if sample_subset is not None:
vcf_samples = vcf.samples
keep = [s for s in vcf_samples if s in sample_subset]
if not keep:
raise ValueError(f"No sample_list samples in {os.path.basename(vf)}")
vcf.set_samples(keep)
return vcf
def extract_dosage(var):
"""Extract diploid dosage from a cyvcf2 variant. Returns list of floats (nan for missing)."""
return [sum(x[0:2]) for x in [[nan if v == -1 else v for v in gt] for gt in var.genotypes]]
def fill_missing_col_means(G):
col_means = np.nanmean(G, axis=0)
return np.where(np.isnan(G), col_means, G)
# -- Single-pass: scan variants, filter, and collect dosages ---
# BED is 0-based half-open [start, end); VCF is 1-based.
print(f"[1/3] Scanning {chrm_str} [{block_start:,}, {block_end:,}) ...")
vcf_files = get_vcf_files(chrm_str)
region = f"{chrm_str}:{block_start+1}-{block_end}"
var_info = []
dosage_matrix = []
n_samples = None
# Filter counters
n_total = 0
n_multiallelic = 0
n_monomorphic = 0
n_all_na = 0
n_low_maf = 0
n_low_mac = 0
n_high_msng = 0
for vf in vcf_files:
vcf = open_vcf(vf, sample_subset)
if n_samples is None:
n_samples = len(vcf.samples)
for var in vcf(region):
if not (block_start <= var.POS - 1 < block_end):
continue
n_total += 1
if len(var.ALT) != 1:
n_multiallelic += 1
continue
dosage = extract_dosage(var)
if np.nanvar(dosage) == 0:
n_monomorphic += 1
continue
nan_count = int(np.sum(np.isnan(dosage)))
n_non_na = len(dosage) - nan_count
if n_non_na == 0:
n_all_na += 1
continue
alt_sum = float(np.nansum(dosage))
mac = min(2 * n_non_na - alt_sum, alt_sum)
maf = mac / (2 * n_non_na)
af = alt_sum / (2 * n_non_na)
msng_rate = nan_count / len(dosage)
if msng_rate > msng_min:
n_high_msng += 1
continue
if maf < maf_min:
n_low_maf += 1
continue
if mac < mac_min:
n_low_mac += 1
continue
var_info.append({
"chr": var.CHROM, "pos": var.POS,
"ref": var.REF, "alt": var.ALT[0],
"af": round(float(af), 6),
"id": f"{var.CHROM}:{var.POS}:{var.REF}:{var.ALT[0]}",
"obs_ct": 2 * n_non_na,
})
dosage_matrix.append(dosage)
vcf.close()
n_passed = len(var_info)
print(f" {n_total:,} total variants in region")
print(f" {n_passed:,} passed filters (n={n_samples:,})")
print(f" Filtered: {n_multiallelic:,} multiallelic, "
f"{n_monomorphic:,} monomorphic, {n_all_na:,} all-NA, "
f"{n_high_msng:,} high-missingness, "
f"{n_low_maf:,} low-MAF, {n_low_mac:,} low-MAC")
if not var_info:
raise ValueError(f"No passing variants in {chrm_str} [{block_start:,}, {block_end:,})")
# -- Load W ----------------------------------------------------
print(f"[2/3] Loading W ...")
W = np.load(W_path)
if W.shape != (n_samples, B):
raise ValueError(f"W shape mismatch: {W.shape} vs ({n_samples},{B})")
W = W.astype(np.float32)
print(f" W: {W.shape}")
# -- Compute U = $W^T G$ and write output files --------------------
print(f"[3/3] Computing U and writing output files ...")
dosage_path = os.path.join(output_dir, f"{cohort_id}.{block_tag}.dosage.gz")
map_path = os.path.join(output_dir, f"{cohort_id}.{block_tag}.map")
afreq_path = os.path.join(output_dir, f"{cohort_id}.{block_tag}.afreq")
meta_path = os.path.join(output_dir, f"{cohort_id}.{block_tag}.meta")
# Write .map
with open(map_path, "w") as fh:
for v in var_info:
fh.write(f"{v['chr']}\t{v['id']}\t0\t{v['pos']}\n")
# Write .meta
with open(meta_path, "w") as fh:
fh.write(f"source_n_samples={n_samples}\nB={B}\n")
fh.write(f"chrom={chrm_str}\nblock_start={block_start}\nblock_end={block_end}\n")
fh.write(f"n_total={n_total}\nn_passed={n_passed}\n")
fh.write(f"n_multiallelic={n_multiallelic}\nn_monomorphic={n_monomorphic}\n")
fh.write(f"n_all_na={n_all_na}\nn_high_msng={n_high_msng}\n")
fh.write(f"n_low_maf={n_low_maf}\nn_low_mac={n_low_mac}\n")
# Build G from collected dosages, compute U = $W^T G$, write dosage.gz
# Dosage format=1: ID ALT REF val_S1 ... val_SB
# Min-max scaling to [0, 2] makes the output plink2-compatible as dosage.
# This preserves correlation structure (cor is scale-invariant) which is
# what matters for LD computation downstream.
G = np.array(dosage_matrix, dtype=np.float32).T # (n_samples, n_variants)
del dosage_matrix
G = fill_missing_col_means(G)
# variant-wise scaling
col_mean = G.mean(axis=0, keepdims=True)
col_std = G.std(axis=0, keepdims=True)
# avoid division by zero
col_std[col_std == 0] = 1.0
G = (G - col_mean) / col_std
U = W.T @ G # (B, n_variants)
del G
col_min = U.min(axis=0)
col_max = U.max(axis=0)
denom = col_max - col_min
denom[denom == 0] = 1.0
U = 2.0 * (U - col_min) / denom
U = np.round(U, 4)
# Record the col min and max for U
with open(afreq_path, "w") as fh:
# Add column headers
fh.write("#CHROM\tID\tREF\tALT\tALT_FREQS\tOBS_CT\tU_MIN\tU_MAX\n")
for j, v in enumerate(var_info):
fh.write(f"{v['chr']}\t{v['id']}\t{v['ref']}\t{v['alt']}\t"
f"{v['af']:.6f}\t{v['obs_ct']}\t"
f"{col_min[j]:.6f}\t{col_max[j]:.6f}\n")
with gzip.open(dosage_path, "wt", compresslevel=4) as gz:
for j, v in enumerate(var_info):
vals = " ".join(f"{x:.4f}" for x in U[:, j])
gz.write(f"{v['id']} {v['alt']} {v['ref']} {vals}\n")
del U
print(f" Written: {len(var_info):,} variants -> {os.path.basename(dosage_path)}")
print(f" Written: {os.path.basename(map_path)}, {os.path.basename(afreq_path)}")
print(f"\nDone: {chrm_str} [{block_start:,}, {block_end:,})")
[merge_chrom]
parameter: chrom = 0
parameter: output_dir = str
parameter: cohort_id = str
parameter: plink2_bin = "plink2"
import os, glob
def chromsto_process(output_dir, chrom_filter):
if chrom_filter != 0:
return [f"chr{chrom_filter}"]
return sorted(set(
os.path.basename(d)
for d in glob.glob(os.path.join(output_dir, "chr*"))
if os.path.isdir(d)
))
chroms = chromsto_process(output_dir, chrom)
input: for_each = "chroms"
output: f"{output_dir}/{_chroms}/{cohort_id}.{_chroms}.pgen"
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads
bash: expand = "$[ ]"
set -euo pipefail
shopt -s nullglob
chrom_dir="$[output_dir]/$[_chroms]"
final_prefix="${chrom_dir}/$[cohort_id].$[_chroms]"
merge_list="${chrom_dir}/$[cohort_id].$[_chroms]_pmerge_list.txt"
# Step 1: Convert each block dosage.gz -> sorted per-block pgen
> "${merge_list}"
files=("${chrom_dir}"/*/*.dosage.gz)
if [ ${#files[@]} -eq 0 ]; then
echo "No dosage files found in ${chrom_dir}" >&2
exit 1
fi
for dosage_gz in "${files[@]}"; do
block_dir=$(dirname "${dosage_gz}")
block_tag=$(basename "${block_dir}")
prefix="${block_dir}/$[cohort_id].${block_tag}_tmp"
map_file="${block_dir}/$[cohort_id].${block_tag}.map"
psam_file="${block_dir}/$[cohort_id].${block_tag}.psam"
meta_file="${block_dir}/$[cohort_id].${block_tag}.meta"
B=$(grep "^B=" "${meta_file}" | cut -d= -f2)
printf '#FID\tIID\n' > "${psam_file}"
for i in $(seq 1 ${B}); do
printf 'S%d\tS%d\n' ${i} ${i} >> "${psam_file}"
done
$[plink2_bin] \
--import-dosage "${dosage_gz}" format=1 noheader \
--psam "${psam_file}" \
--map "${map_file}" \
--make-pgen \
--out "${prefix}_unsorted" \
--silent
$[plink2_bin] \
--pfile "${prefix}_unsorted" \
--make-pgen \
--sort-vars \
--out "${prefix}" \
--silent
rm -f "${prefix}_unsorted.pgen" "${prefix}_unsorted.pvar" "${prefix}_unsorted.psam"
echo "${prefix}" >> "${merge_list}"
done
# Step 2: Merge all per-block pgens -> one per-chrom pgen
$[plink2_bin] \
--pmerge-list "${merge_list}" pfile \
--make-pgen \
--sort-vars \
--out "${final_prefix}"
# Remove PLINK merge intermediates immediately after merge
rm -f "${final_prefix}-merge.pgen" "${final_prefix}-merge.pvar" "${final_prefix}-merge.psam"
# Step 3: Concatenate .afreq
first=1
for f in "${chrom_dir}"/*/*.afreq; do
if [ "${first}" -eq 1 ]; then
cat "${f}" > "${final_prefix}.afreq"
first=0
else
tail -n +2 "${f}" >> "${final_prefix}.afreq"
fi
done
R: expand = "$[ ]"
library(data.table)
meta_files <- list.files("$[output_dir]/$[_chroms]",
pattern = "[.]meta$", recursive = TRUE,
full.names = TRUE)
if (length(meta_files) > 0) {
fields <- c("n_total", "n_passed", "n_multiallelic", "n_monomorphic",
"n_all_na", "n_high_msng", "n_low_maf", "n_low_mac")
stats <- rbindlist(lapply(meta_files, function(f) {
lines <- grep("^n_", readLines(f), value = TRUE)
kv <- strsplit(lines, "=")
vals <- setNames(as.integer(sapply(kv, `[`, 2)), sapply(kv, `[`, 1))
as.data.table(as.list(vals[fields]))
}))
totals <- colSums(stats, na.rm = TRUE)
summary <- data.frame(t(totals))
summary$pct_dropped <- round(100 * (1 - summary$n_passed / summary$n_total), 1)
cat("\n=== Filter Summary for $[_chroms] ===\n")
print(data.frame(value = unlist(summary), row.names = names(summary)))
}
bash: expand = "$[ ]"
# Step 5: Cleanup block intermediates
chrom_dir="$[output_dir]/$[_chroms]"
final_prefix="${chrom_dir}/$[cohort_id].$[_chroms]"
rm -f "${final_prefix}_pmerge_list.txt"
for block_dir in "${chrom_dir}"/*/; do
rm -rf "${block_dir}"
done
Troubleshooting#
Symptom |
Cause |
Fix |
|---|---|---|
|
VCF naming or extension mismatch |
Files must end in |
|
|
Re-run |
|
Filters removed everything (small toy cohort) |
Widen |
|
|
Ensure the BED |
Region query returns nothing |
Missing tabix index |
Run |
Output#
Per chromosome, under --cwd:
{cohort_id}.chr{N}.pgen— binary genotype-sketch data (B pseudo-samples × p variants){cohort_id}.chr{N}.pvar— variant information{cohort_id}.chr{N}.psam— sample (sketch) information{cohort_id}.chr{N}.afreq— allele frequencies
These feed SuSiE-RSS fine-mapping: load with a metadata TSV (one row per chromosome, columns #chrom start end path, path = pgen prefix). Use the X (genotype) interface for susie_rss(z, X=X) or the R (correlation) interface for susie_rss(z, R=R).
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.