RSS LD Sketch Pipeline

RSS LD Sketch Pipeline#

Description#

This pipeline generates a stochastic genotype sample U = WᵀG from whole-genome sequencing VCF files and stores it as a PLINK2 pgen file for use as an LD reference panel with SuSiE-RSS fine-mapping.

Key idea: Rather than storing the full genotype matrix G (n × p), we compute U = WᵀG (B × p) using a random projection matrix \(W \sim N(0, 1/\sqrt{n})\). The approximate LD matrix \(R = U^T U / B \approx G^T G / n\) by the Johnson–Lindenstrauss lemma. G is never stored.

Matrix dimensions:

G : (n × p) — n individuals × p variants
W : (n × B) — projection matrix, generated once per cohort
U : (B × p) — stochastic genotype sample = WᵀG, stored in pgen
\(\hat{R}\) : (p × p) — approximate LD matrix, computed on-the-fly by SuSiE-RSS from U

The workflow has three steps run in order: generate_W (build the projection matrix), process_block (read VCF per LD block and write per-block dosage sketches), and merge_chrom (merge per-block dosages into one per-chromosome pgen).

Note on data: This example runs on a clearly-labeled synthetic toy dataset — a chr22 VCF (protocol_example.genotype.chr22.bgz, 60 individuals) and a 3-block LD-block BED (protocol_example.ld_blocks.bed). No access-controlled individual-level human genomic data is used.

Input#

LD-block BED (--ld-block-file): tab-separated file with columns chr, start, end (0-based half-open) defining the regions to sketch. Toy file: input/rss_ld_sketch/protocol_example.ld_blocks.bed (3 chr22 blocks).
VCF directory (--vcf-base) and prefix (--vcf-prefix): bgzipped (.bgz) + tabix-indexed VCF(s) named {vcf_prefix}{chr}.*.bgz. Toy file: input/rss_ld_sketch/protocol_example.genotype.chr22.bgz (60 individuals), discovered with --vcf-base input/rss_ld_sketch --vcf-prefix protocol_example.genotype.
--n-samples: number of individuals in the VCF (here 60). Must match the VCF sample count.
--B: number of sketch (pseudo-)samples / projection dimension (the toy uses a small B for speed; production uses ~10000).
--chrom: chromosome to process (e.g. 22; 0 = all autosomes found).
--cohort-id: label used to name output files.

Filter thresholds (defaults shown in the implementation): --maf-min 0.0005, --mac-min 5, --msng-min 0.05.

Steps#

Run the three workflows in order. generate_W builds the shared projection matrix once; process_block sketches each LD block; merge_chrom assembles the per-chromosome pgen.

Timing: ~10-20 min (chr22) on typical compute infrastructure.

Step 1. Generate the projection matrix W (run once per cohort; `--n-samples` must equal the VCF sample count).#

sos run pipeline/rss_ld_sketch.ipynb generate_W \
    --n-samples 60 \
    --output-dir output/rss_ld_sketch \
    --B 50 \
    --seed 123 \
    --cwd output/rss_ld_sketch

/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
INFO: Running generate_W: 
INFO: generate_W is completed.
INFO: generate_W output:   output/rss_ld_sketch/W_B50.npy
INFO: Workflow generate_W (ID=w68f63c60d8da4b5e) is executed successfully with 1 completed step.

Step 2. Process all LD blocks for the chromosome — read the VCF, filter variants, and write per-block dosage sketches U = WᵀG.#

sos run pipeline/rss_ld_sketch.ipynb process_block \
    --ld-block-file input/rss_ld_sketch/protocol_example.ld_blocks.bed \
    --chrom 22 \
    --vcf-base input/rss_ld_sketch \
    --vcf-prefix protocol_example.genotype. \
    --output-dir output/rss_ld_sketch \
    --W-matrix output/rss_ld_sketch/W_B50.npy \
    --B 50 \
    --cohort-id protocol_example. \
    --cwd output/rss_ld_sketch

/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
INFO: Running process_block: 
  3 LD blocks queued
INFO: process_block (index=0) is completed.
INFO: process_block (index=1) is completed.
INFO: process_block (index=2) is completed.
INFO: process_block output:   output/rss_ld_sketch/chr22/chr22_16000000_20000000/protocol_example..chr22_16000000_20000000.dosage.gz output/rss_ld_sketch/chr22/chr22_30000000_34000000/protocol_example..chr22_30000000_34000000.dosage.gz... (3 items in 3 groups)
INFO: Workflow process_block (ID=w8c1e759d62203ef6) is executed successfully with 1 completed step and 3 completed substeps.

Step 3. Merge the per-block dosage sketches into one per-chromosome PLINK2 pgen.#

sos run pipeline/rss_ld_sketch.ipynb merge_chrom \
    --output-dir output/rss_ld_sketch \
    --cohort-id protocol_example. \
    --chrom 22 \
    --cwd output/rss_ld_sketch

/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
INFO: Running merge_chrom: 
PLINK v2.0.0-a.6.9LM 64-bit Intel (29 Jan 2025)    cog-genomics.org/plink/2.0/
(C) 2005-2025 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to output/rss_ld_sketch/chr22/protocol_example..chr22.log.
Options in effect:
  --make-pgen
  --out output/rss_ld_sketch/chr22/protocol_example..chr22
  --pmerge-list output/rss_ld_sketch/chr22/protocol_example..chr22_pmerge_list.txt pfile
  --sort-vars

Start time: Tue Jun 23 09:52:54 2026
191527 MiB RAM detected, ~187643 available; reserving 95763 MiB for main
workspace.
Using up to 32 threads (change this with --threads).
--pmerge-list: 3 filesets specified.
--pmerge-list: 50 samples present.
--pmerge-list: Merged .psam written to
output/rss_ld_sketch/chr22/protocol_example..chr22-merge.psam .
--pmerge-list: 3 .pvar files scanned.
Concatenation job detected.
Concatenating... 673/673 variants complete.
Results written to
output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pgen +
output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pvar .
50 samples (0 females, 0 males, 50 ambiguous; 50 founders) loaded from
output/rss_ld_sketch/chr22/protocol_example..chr22-merge.psam.
673 variants loaded from
output/rss_ld_sketch/chr22/protocol_example..chr22-merge.pvar.
Note: No phenotype data present.
Writing output/rss_ld_sketch/chr22/protocol_example..chr22.pvar ... done.
Writing output/rss_ld_sketch/chr22/protocol_example..chr22.psam ... done.
Writing output/rss_ld_sketch/chr22/protocol_example..chr22.pgen ... done.
End time: Tue Jun 23 09:52:54 2026

=== Filter Summary for chr22 ===
                value
n_total        6157.0
n_passed        673.0
n_multiallelic    0.0
n_monomorphic  4701.0
n_all_na          0.0
n_high_msng     101.0
n_low_maf         0.0
n_low_mac       682.0
pct_dropped      89.1
INFO: merge_chrom is completed.
INFO: merge_chrom output:   output/rss_ld_sketch/chr22/protocol_example..chr22.pgen
INFO: Workflow merge_chrom (ID=w8e5a670551e06660) is executed successfully with 1 completed step.

Command interface#

List every workflow and its parameters:

sos run pipeline/rss_ld_sketch.ipynb -h

Workflow implementation#

[global]
parameter: cwd        = path("output")
parameter: job_size   = 1
parameter: walltime   = "24:00:00"
parameter: mem        = "32G"
parameter: numThreads = 8

cwd = path(f'{cwd:a}')

[generate_W]
# Generate projection matrix $W \sim N(0, 1/\sqrt{n})$, shape (n x B).
# Run ONCE before processing any chromosome.
#
# W depends only on n (total sample size) and B -- not on any variant data.
# n_samples is passed directly as a parameter; no VCF reading is needed.
# All 22 chromosomes reuse the same W so that per-chromosome stochastic
# genotype samples can be arithmetically merged for meta-analysis.
parameter: n_samples = int
parameter: output_dir    = str
parameter: B         = 10000
parameter: seed      = 123

import os
input:  []
output: f'{output_dir}/W_B{B}.npy'
task: trunk_workers = 1, trunk_size = 1, walltime = '00:05:00', mem = '4G', cores = 1
python: expand = "${ }", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'

    import numpy as np
    import os

    n      = ${n_samples}
    B      = ${B}
    seed   = ${seed}
    W_out  = "${_output}"

    # -- Generate $W \sim N(0, 1/\sqrt{n})$ -----------------------------
    # Convention: W = np.random.normal(0, 1/np.sqrt(n), size=(n, B))
    # W is shared across all chromosomes -- do not regenerate per chromosome.
    print(f"Generating W ~ N(0, 1/sqrt({n})),  shape ({n}, {B}),  seed={seed}")
    np.random.seed(seed)
    W = np.random.normal(0, 1.0 / np.sqrt(n), size=(n, B)).astype(np.float32)

    os.makedirs(os.path.dirname(os.path.abspath(W_out)), exist_ok=True)
    np.save(W_out, W)
    print(f"Saved: {W_out}")
    print(f"Shape: {W.shape},  size: {os.path.getsize(W_out)/1e9:.2f} GB")

[process_block]
parameter: ld_block_file = str
parameter: chrom         = 0
parameter: vcf_base      = str
parameter: vcf_prefix    = str
parameter: cohort_id     = "ADSP.R5.EUR"
parameter: output_dir    = str
parameter: W_matrix      = str
parameter: B             = 10000
parameter: maf_min       = 0.0005
parameter: mac_min       = 5
parameter: msng_min      = 0.05
parameter: sample_list   = ""

import os

def _read_blocks(bed, chrom_filter):
    blocks = []
    with open(bed) as fh:
        for line in fh:
            if line.startswith("#") or not line.strip():
                continue
            parts = line.split()
            c = parts[0]
            if not (c.startswith("chr") and c[3:].isdigit()):
                continue
            cnum = int(c[3:])
            if not (1 <= cnum <= 22):
                continue
            if chrom_filter != 0 and cnum != chrom_filter:
                continue
            blocks.append({"chr": c, "start": int(parts[1]), "end": int(parts[2])})
    if not blocks:
        raise ValueError(f"No blocks found for chrom={chrom_filter} in {bed}")
    return blocks

blocks = _read_blocks(ld_block_file, chrom)
print(f"  {len(blocks)} LD blocks queued")

input: for_each = "blocks"
output: f'{output_dir}/{_blocks["chr"]}/{_blocks["chr"]}_{_blocks["start"]}_{_blocks["end"]}/{cohort_id}.{_blocks["chr"]}_{_blocks["start"]}_{_blocks["end"]}.dosage.gz'
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads
python: expand = "${ }"

    import numpy as np
    import os
    import gzip
    import sys
    import atexit
    from math import nan
    from cyvcf2 import VCF
    from os import listdir

    # Block coordinates from for_each loop
    chrm_str    = "${_blocks['chr']}"
    block_start = ${_blocks["start"]}
    block_end   = ${_blocks["end"]}

    vcf_base    = "${vcf_base}"
    vcf_prefix  = "${vcf_prefix}"
    W_path      = "${W_matrix}"
    B           = ${B}
    maf_min     = ${maf_min}
    mac_min     = ${mac_min}
    msng_min    = ${msng_min}
    sample_list = "${sample_list}"
    cohort_id   = "${cohort_id}"
    base_dir    = "${output_dir}"

    block_tag   = f"{chrm_str}_{block_start}_{block_end}"
    output_dir  = os.path.join(base_dir, chrm_str, block_tag)
    os.makedirs(output_dir, exist_ok=True)

    log_path = os.path.join(output_dir, f"{block_tag}.log")
    log_fh   = open(log_path, "w")
    sys.stdout = log_fh
    sys.stderr = log_fh
    atexit.register(log_fh.close)

    # -- Load sample subset (optional) -----------------------------
    sample_subset = None
    if sample_list:
        if not os.path.exists(sample_list):
            raise FileNotFoundError(f"sample_list not found: {sample_list}")
        with open(sample_list) as fh:
            sample_subset = set(line.strip() for line in fh if line.strip())
        print(f"  Sample subset: {len(sample_subset):,} samples")

    # -- Helpers ---------------------------------------------------
    def get_vcf_files(chrm_str):
        files = sorted([
            os.path.join(vcf_base, x)
            for x in listdir(vcf_base)
            if x.endswith(".bgz") and (
                x.startswith(vcf_prefix + chrm_str + ":") or
                x.startswith(vcf_prefix + chrm_str + ".")
            )
        ])
        if not files:
            raise FileNotFoundError(f"No VCF files for {chrm_str} in {vcf_base}")
        return files

    def open_vcf(vf, sample_subset):
        """Open a VCF file, applying sample subset if provided."""
        vcf = VCF(vf)
        if sample_subset is not None:
            vcf_samples = vcf.samples
            keep = [s for s in vcf_samples if s in sample_subset]
            if not keep:
                raise ValueError(f"No sample_list samples in {os.path.basename(vf)}")
            vcf.set_samples(keep)
        return vcf

    def extract_dosage(var):
        """Extract diploid dosage from a cyvcf2 variant. Returns list of floats (nan for missing)."""
        return [sum(x[0:2]) for x in [[nan if v == -1 else v for v in gt] for gt in var.genotypes]]

    def fill_missing_col_means(G):
        col_means = np.nanmean(G, axis=0)
        return np.where(np.isnan(G), col_means, G)

    # -- Single-pass: scan variants, filter, and collect dosages ---
    # BED is 0-based half-open [start, end); VCF is 1-based.
    print(f"[1/3] Scanning {chrm_str} [{block_start:,}, {block_end:,}) ...")
    vcf_files = get_vcf_files(chrm_str)
    region    = f"{chrm_str}:{block_start+1}-{block_end}"
    var_info  = []
    dosage_matrix = []
    n_samples = None
    # Filter counters
    n_total = 0
    n_multiallelic = 0
    n_monomorphic = 0
    n_all_na = 0
    n_low_maf = 0
    n_low_mac = 0
    n_high_msng = 0

    for vf in vcf_files:
        vcf = open_vcf(vf, sample_subset)
        if n_samples is None:
            n_samples = len(vcf.samples)
        for var in vcf(region):
            if not (block_start <= var.POS - 1 < block_end):
                continue
            n_total += 1
            if len(var.ALT) != 1:
                n_multiallelic += 1
                continue
            dosage = extract_dosage(var)
            if np.nanvar(dosage) == 0:
                n_monomorphic += 1
                continue
            nan_count = int(np.sum(np.isnan(dosage)))
            n_non_na  = len(dosage) - nan_count
            if n_non_na == 0:
                n_all_na += 1
                continue
            alt_sum   = float(np.nansum(dosage))
            mac       = min(2 * n_non_na - alt_sum, alt_sum)
            maf       = mac / (2 * n_non_na)
            af        = alt_sum / (2 * n_non_na)
            msng_rate = nan_count / len(dosage)
            if msng_rate > msng_min:
                n_high_msng += 1
                continue
            if maf < maf_min:
                n_low_maf += 1
                continue
            if mac < mac_min:
                n_low_mac += 1
                continue
            var_info.append({
                "chr": var.CHROM, "pos": var.POS,
                "ref": var.REF,   "alt": var.ALT[0],
                "af":  round(float(af), 6),
                "id":  f"{var.CHROM}:{var.POS}:{var.REF}:{var.ALT[0]}",
                "obs_ct": 2 * n_non_na,
            })
            dosage_matrix.append(dosage)
        vcf.close()

    n_passed = len(var_info)
    print(f"  {n_total:,} total variants in region")
    print(f"  {n_passed:,} passed filters (n={n_samples:,})")
    print(f"  Filtered: {n_multiallelic:,} multiallelic, "
          f"{n_monomorphic:,} monomorphic, {n_all_na:,} all-NA, "
          f"{n_high_msng:,} high-missingness, "
          f"{n_low_maf:,} low-MAF, {n_low_mac:,} low-MAC")

    if not var_info:
        raise ValueError(f"No passing variants in {chrm_str} [{block_start:,}, {block_end:,})")

    # -- Load W ----------------------------------------------------
    print(f"[2/3] Loading W ...")
    W = np.load(W_path)
    if W.shape != (n_samples, B):
        raise ValueError(f"W shape mismatch: {W.shape} vs ({n_samples},{B})")
    W = W.astype(np.float32)
    print(f"  W: {W.shape}")

    # -- Compute U = $W^T G$ and write output files --------------------
    print(f"[3/3] Computing U and writing output files ...")

    dosage_path = os.path.join(output_dir, f"{cohort_id}.{block_tag}.dosage.gz")
    map_path    = os.path.join(output_dir, f"{cohort_id}.{block_tag}.map")
    afreq_path  = os.path.join(output_dir, f"{cohort_id}.{block_tag}.afreq")
    meta_path   = os.path.join(output_dir, f"{cohort_id}.{block_tag}.meta")

    # Write .map
    with open(map_path, "w") as fh:
        for v in var_info:
            fh.write(f"{v['chr']}\t{v['id']}\t0\t{v['pos']}\n")

    # Write .meta
    with open(meta_path, "w") as fh:
        fh.write(f"source_n_samples={n_samples}\nB={B}\n")
        fh.write(f"chrom={chrm_str}\nblock_start={block_start}\nblock_end={block_end}\n")
        fh.write(f"n_total={n_total}\nn_passed={n_passed}\n")
        fh.write(f"n_multiallelic={n_multiallelic}\nn_monomorphic={n_monomorphic}\n")
        fh.write(f"n_all_na={n_all_na}\nn_high_msng={n_high_msng}\n")
        fh.write(f"n_low_maf={n_low_maf}\nn_low_mac={n_low_mac}\n")


    # Build G from collected dosages, compute U = $W^T G$, write dosage.gz
    # Dosage format=1: ID ALT REF val_S1 ... val_SB
    # Min-max scaling to [0, 2] makes the output plink2-compatible as dosage.
    # This preserves correlation structure (cor is scale-invariant) which is
    # what matters for LD computation downstream.
    G = np.array(dosage_matrix, dtype=np.float32).T  # (n_samples, n_variants)
    del dosage_matrix
    G = fill_missing_col_means(G)

    # variant-wise scaling
    col_mean = G.mean(axis=0, keepdims=True)
    col_std  = G.std(axis=0, keepdims=True)
    # avoid division by zero
    col_std[col_std == 0] = 1.0
    G = (G - col_mean) / col_std

    U = W.T @ G  # (B, n_variants)
    del G

    col_min = U.min(axis=0)
    col_max = U.max(axis=0)
    denom   = col_max - col_min
    denom[denom == 0] = 1.0
    U = 2.0 * (U - col_min) / denom
    U = np.round(U, 4)

    # Record the col min and max for U

    with open(afreq_path, "w") as fh:
        # Add column headers
        fh.write("#CHROM\tID\tREF\tALT\tALT_FREQS\tOBS_CT\tU_MIN\tU_MAX\n")
        for j, v in enumerate(var_info):
            fh.write(f"{v['chr']}\t{v['id']}\t{v['ref']}\t{v['alt']}\t"
                     f"{v['af']:.6f}\t{v['obs_ct']}\t"
                     f"{col_min[j]:.6f}\t{col_max[j]:.6f}\n")

    with gzip.open(dosage_path, "wt", compresslevel=4) as gz:
        for j, v in enumerate(var_info):
            vals = " ".join(f"{x:.4f}" for x in U[:, j])
            gz.write(f"{v['id']} {v['alt']} {v['ref']} {vals}\n")

    del U
    print(f"  Written: {len(var_info):,} variants -> {os.path.basename(dosage_path)}")
    print(f"  Written: {os.path.basename(map_path)}, {os.path.basename(afreq_path)}")
    print(f"\nDone: {chrm_str} [{block_start:,}, {block_end:,})")

[merge_chrom]
parameter: chrom      = 0
parameter: output_dir = str
parameter: cohort_id  = str
parameter: plink2_bin = "plink2"

import os, glob

def chromsto_process(output_dir, chrom_filter):
    if chrom_filter != 0:
        return [f"chr{chrom_filter}"]
    return sorted(set(
        os.path.basename(d)
        for d in glob.glob(os.path.join(output_dir, "chr*"))
        if os.path.isdir(d)
    ))

chroms = chromsto_process(output_dir, chrom)

input: for_each = "chroms"
output: f"{output_dir}/{_chroms}/{cohort_id}.{_chroms}.pgen"
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads
bash: expand = "$[ ]"

    set -euo pipefail
    shopt -s nullglob

    chrom_dir="$[output_dir]/$[_chroms]"
    final_prefix="${chrom_dir}/$[cohort_id].$[_chroms]"
    merge_list="${chrom_dir}/$[cohort_id].$[_chroms]_pmerge_list.txt"

    # Step 1: Convert each block dosage.gz -> sorted per-block pgen
    > "${merge_list}"
    files=("${chrom_dir}"/*/*.dosage.gz)
    if [ ${#files[@]} -eq 0 ]; then
        echo "No dosage files found in ${chrom_dir}" >&2
        exit 1
    fi
    for dosage_gz in "${files[@]}"; do
        block_dir=$(dirname "${dosage_gz}")
        block_tag=$(basename "${block_dir}")
        prefix="${block_dir}/$[cohort_id].${block_tag}_tmp"
        map_file="${block_dir}/$[cohort_id].${block_tag}.map"
        psam_file="${block_dir}/$[cohort_id].${block_tag}.psam"
        meta_file="${block_dir}/$[cohort_id].${block_tag}.meta"

        B=$(grep "^B=" "${meta_file}" | cut -d= -f2)
        printf '#FID\tIID\n' > "${psam_file}"
        for i in $(seq 1 ${B}); do
            printf 'S%d\tS%d\n' ${i} ${i} >> "${psam_file}"
        done

        $[plink2_bin] \
            --import-dosage "${dosage_gz}" format=1 noheader \
            --psam "${psam_file}" \
            --map  "${map_file}" \
            --make-pgen \
            --out  "${prefix}_unsorted" \
            --silent

        $[plink2_bin] \
            --pfile "${prefix}_unsorted" \
            --make-pgen \
            --sort-vars \
            --out  "${prefix}" \
            --silent

        rm -f "${prefix}_unsorted.pgen" "${prefix}_unsorted.pvar" "${prefix}_unsorted.psam"
        echo "${prefix}" >> "${merge_list}"
    done

    # Step 2: Merge all per-block pgens -> one per-chrom pgen
    $[plink2_bin] \
        --pmerge-list "${merge_list}" pfile \
        --make-pgen \
        --sort-vars \
        --out  "${final_prefix}"

    # Remove PLINK merge intermediates immediately after merge
    rm -f "${final_prefix}-merge.pgen" "${final_prefix}-merge.pvar" "${final_prefix}-merge.psam"

    # Step 3: Concatenate .afreq
    first=1
    for f in "${chrom_dir}"/*/*.afreq; do
        if [ "${first}" -eq 1 ]; then
            cat "${f}" > "${final_prefix}.afreq"
            first=0
        else
            tail -n +2 "${f}" >> "${final_prefix}.afreq"
        fi
    done

R: expand = "$[ ]"

    library(data.table)
    meta_files <- list.files("$[output_dir]/$[_chroms]",
                             pattern = "[.]meta$", recursive = TRUE,
                             full.names = TRUE)
    if (length(meta_files) > 0) {
      fields <- c("n_total", "n_passed", "n_multiallelic", "n_monomorphic",
                  "n_all_na", "n_high_msng", "n_low_maf", "n_low_mac")
      stats <- rbindlist(lapply(meta_files, function(f) {
        lines <- grep("^n_", readLines(f), value = TRUE)
        kv <- strsplit(lines, "=")
        vals <- setNames(as.integer(sapply(kv, `[`, 2)), sapply(kv, `[`, 1))
        as.data.table(as.list(vals[fields]))
      }))
      totals <- colSums(stats, na.rm = TRUE)
      summary <- data.frame(t(totals))
      summary$pct_dropped <- round(100 * (1 - summary$n_passed / summary$n_total), 1)
      cat("\n=== Filter Summary for $[_chroms] ===\n")
      print(data.frame(value = unlist(summary), row.names = names(summary)))
    }

bash: expand = "$[ ]"

    # Step 5: Cleanup block intermediates
    chrom_dir="$[output_dir]/$[_chroms]"
    final_prefix="${chrom_dir}/$[cohort_id].$[_chroms]"

    rm -f "${final_prefix}_pmerge_list.txt"
    for block_dir in "${chrom_dir}"/*/; do
        rm -rf "${block_dir}"
    done

Troubleshooting#

Symptom	Cause	Fix
`No VCF files for chrXX in {vcf_base}`	VCF naming or extension mismatch	Files must end in `.bgz` and be named `{vcf_prefix}{chr}.*.bgz`; check `--vcf-base` and `--vcf-prefix`.
`W shape mismatch`	`--n-samples` or `--B` differs from the W used	Re-run `generate_W` with the same `--n-samples` and `--B`, and pass that `W_B{B}.npy` to `process_block`.
`No passing variants in chrXX`	Filters removed everything (small toy cohort)	Widen `--maf-min` / `--mac-min` / `--msng-min`, or choose blocks with more variants.
`No blocks found for chrom=XX`	`--chrom` does not match any BED rows	Ensure the BED `chr` column matches (e.g. `chr22`) and `--chrom` is the matching number.
Region query returns nothing	Missing tabix index	Run `tabix -p vcf file.bgz` to create the `.tbi`.

Output#

Per chromosome, under --cwd:

{cohort_id}.chr{N}.pgen — binary genotype-sketch data (B pseudo-samples × p variants)
{cohort_id}.chr{N}.pvar — variant information
{cohort_id}.chr{N}.psam — sample (sketch) information
{cohort_id}.chr{N}.afreq — allele frequencies

These feed SuSiE-RSS fine-mapping: load with a metadata TSV (one row per chromosome, columns #chrom start end path, path = pgen prefix). Use the X (genotype) interface for susie_rss(z, X=X) or the R (correlation) interface for susie_rss(z, R=R).

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.