Steps

Steps#

Univariate Fine-Mapping and TWAS with SuSiE#

Description#

This pipeline performs per-region univariate fine-mapping with SuSiE and computes TWAS weights, via the susie_twas and mnm workflows of pipeline/mnm_regression.ipynb.

qtl_dataset_construct+susie_twas — builds a QtlDataset RDS from genotype/phenotype/covariate files, then for each region fits SuSiE fine-mapping and trains TWAS prediction weights with 10 methods (mrash, susie, susie_inf, enet, lasso, mcp, scad, l0learn, bayes_r, bayes_c), plus a CV-based ensemble weight. Produces *.univariate_bvsr.rds (fine-mapping) and *.univariate_twas_weights.rds (TWAS weights).
mnm — reads all susie_twas outputs (via --fine-mapping-meta), pools CV predictions across multiple contexts, solves a constrained QP to find the optimal method/context mixture, and produces *.multicontext_twas_weights.rds.

SuSiE searches for up to L=5 independent signals per region. Covariates are regressed out of both phenotype and genotype matrices before fitting (SuSiE itself does not accept covariates).

What is a “region”?#

A region is one row in the phenotype manifest (--phenoFile) plus its association window (from --customized-association-windows). One susie_twas call fits one SuSiE + TWAS model per region.

Gene-based phenotypes. Manifest row = one gene; window = gene’s TSS ± cis window (or from a pre-built TADB BED).
Peak-based phenotypes (caQTL, mQTL). Manifest row = one peak/CpG; window = the peak’s containing extended-TAD.

Input Files#

File	Description
`<out_dir>/pheno_manifest.tsv`	Phenotype manifest mapping each region to its phenotype/covariate files
`<out_dir>/association_windows.bed`	Association-test windows (one row per region)
`<out_dir>/peaks_split/<ID>.bed.gz{,.tbi}`	Per-region phenotype data, bgzipped and tabix-indexed

Step 1 – Prepare the phenotype manifest and association-windows BED#

The pipeline expects two derived files per experiment:

pheno_manifest.tsv — 5 columns (#chr start end ID path), one row per region. path points to a bgzipped+tabixed phenotype BED restricted to that region.
association_windows.bed — 4 columns (chr start end ID), one row per region. start/end define the cis search window; ID must match a row in the manifest.

For gene-based phenotypes these files are produced by the standard phenotype preprocessing + TADB generation steps.

For peak-based phenotypes (caQTL, mQTL) you can derive both from the raw peak BED with the helper below; it renames the peak-ID header, splits the multi-sample BED into one bgz+tbi per peak, writes the manifest, and annotates each peak with its smallest containing extended-TAD.

bash code/misc/preprocess/prepare_peak_inputs.sh \
    <raw_peaks.bed.gz> \
    <extended_TADB.bed> \
    <out_dir>

# Produces:
#   <out_dir>/pheno_manifest.tsv
#   <out_dir>/association_windows.bed
#   <out_dir>/peaks_split/<ID>.bed.gz{,.tbi}

Step 2 – Run `susie_twas` + `mnm`#

Gene-based example (single gene, 2-context toy dataset):

# Step 1: build QtlDataset + run SuSiE fine-mapping + TWAS weights (all-in-one)
sos run pipeline/mnm_regression.ipynb qtl_dataset_construct+susie_twas \
    --name protocol_example --cwd output_uni \
    --genoFile input/colocboost/example.chr22.bed \
    --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
    --covFile input/colocboost/example_covariates.tsv \
    --customized-association-windows input/colocboost/association_windows.bed \
    --region-name ENSG00000130538 --transpose-covariates --save-data -j1

# Step 2: multi-context ensemble TWAS weights (mnm)
sos run pipeline/mnm_regression.ipynb mnm \
    --name protocol_example --cwd output_mnm \
    --genoFile input/colocboost/example.chr22.bed \
    --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
    --covFile input/colocboost/example_covariates.tsv \
    --customized-association-windows input/colocboost/association_windows.bed \
    --fine-mapping-meta input/colocboost/fine_mapping_meta.tsv \
    --transpose-covariates --save-data -j1

Peak-based example (same pipeline, swap phenotype/covariate inputs):

# Peak/chromatin example (same pipeline, different input)
sos run pipeline/mnm_regression.ipynb susie_twas \
    --name protocol_example \
    --cwd output_10peaks \
    --genoFile input/genotype/protocol_example.genotype.chr22.bed \
    --phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \
    --covFile input/covariate/protocol_example.covariates.tsv \
    --customized-association-windows input/finemapping/protocol_example.association_windows.bed \
    -j1

Expected output files after susie_twas:

output_uni/fine_mapping/protocol_example.ENSG00000130538.univariate_bvsr.rds
output_uni/twas_weights/protocol_example.ENSG00000130538.univariate_twas_weights.rds
output_uni/fine_mapping_meta.tsv   ← auto-generated, used by mnm as --fine-mapping-meta

After mnm:

output_mnm/twas_weights/protocol_example.ENSG00000130538.multicontext_twas_weights.rds

Command Interface#

sos run pipeline/mnm_regression.ipynb -h

Key parameters for qtl_dataset_construct+susie_twas:

Parameter	Description
`--name`	Study name prefix; all output files are named `{name}.{region_id}.*`
`--cwd`	Output root directory; subdirs `qtl_dataset/`, `fine_mapping/`, `twas_weights/` are created automatically
`--genoFile`	PLINK1 `.bed` file path (`.bim`/`.fam` auto-located); FID/IID must match phenotype/covariate sample IDs
`--phenoFile`	Phenotype manifest TSV — columns `ID`, `path`, `cond` (one row per context); `path` points to a bgzipped+tabix-indexed phenotype BED
`--covFile`	Covariate TSV — rows = covariates, columns = samples (use `--transpose-covariates` if rows = samples)
`--customized-association-windows`	4-column `chr start end ID` BED defining the cis search window per region
`--region-name`	Restrict analysis to one or more region IDs (space-separated); omit to run all regions
`--transpose-covariates`	Transpose covariate file before reading (needed when rows = samples)
`--save-data`	Also emit `.univariate_data.rds` with residualized X/Y matrices; required* for downstream `mnm` step
`-j1`	Number of parallel jobs (set to 1 for debugging)

Additional parameter for mnm:

Parameter	Description
`--fine-mapping-meta`	TSV auto-generated by `susie_twas` (at `{cwd}/fine_mapping_meta.tsv`); maps each region to its `univariate_bvsr.rds` and `univariate_twas_weights.rds` paths

Output Format#

All outputs are written under --cwd, one set per region:

Fine-mapping (output_uni/fine_mapping/):

{name}.{region_id}.univariate_bvsr.rds — QtlFineMappingResult S4 object, one entry per context

TWAS weights (output_uni/twas_weights/):

{name}.{region_id}.univariate_twas_weights.rds — TwasWeights S4 object (per-method + ensemble)

Ensemble TWAS weights (after mnm step, output_mnm/twas_weights/):

{name}.{region_id}.multicontext_twas_weights.rds — multi-context ensemble TwasWeights S4

`QtlFineMappingResult` S4 structure#

Top-level via attr(rds, "listData"):

Field	Description
`study`	Study name (from `--name`)
`context`	Context label per entry
`trait`	Gene/trait ID
`method`	Fine-mapping method (`"susie"`)
`entry`	List of `FineMappingEntry` S4 objects (one per context)

Each FineMappingEntry has these attributes:

Attribute	Description
`variantIds`	SNP IDs (`chr:pos:ref:alt`) for all variants in the region
`susieFit`	Raw `susie()` output — `sets$cs` (credible sets after purity filter), `lbf` (log Bayes factors per effect component), `alpha` (posterior inclusion per effect)
`topLoci`	Per-variant summary data.frame (see column reference in Anticipated Results)
`cvResult`	Cross-validation result used in TWAS weight computation

`TwasWeights` S4 structure#

Top-level via attr(tw, "listData"):

Field	Description
`study`	Study name
`context`	Context label per entry (repeated per method)
`trait`	Gene/trait ID
`method`	Method name: `mrash`, `susie`, `susie_inf`, `enet`, `lasso`, `mcp`, `scad`, `l0learn`, `bayes_r`, `bayes_c`, `ensemble`
`entry`	List of `TwasWeightsEntry` S4 objects (N_contexts × 11 methods total)

Each TwasWeightsEntry accessed via attr(entry, "<name>"):

Attribute	Description
`variantIds`	SNP IDs in the region
`weights`	Per-SNP TWAS weights (β from genotype→expression model)
`fits`	Fitted model object (class-specific)
`cvPerformance`	CV results: `$metrics` (corr, rsq, adj_rsq, pval, RMSE, MAE) for per-method entries; `$methodCoef` + `$methodPerformance` for ensemble
`standardized`	Whether X/Y were standardized (default `FALSE`)

Anticipated Results#

Quick sanity check — inspect cross-validation performance for the first region:

Fine-Mapping Result: `QtlFineMappingResult` S4#

The fine-mapping RDS holds all contexts for one gene in a single object. Access the structured data via attr(rds, "listData"):

Field	Description
`study`	Study name string (from `--study`)
`context`	Context label for each entry (one entry per context)
`trait`	Gene/trait ID (Ensembl)
`method`	Fine-mapping method used (`"susie"`)
`entry`	List of `FineMappingEntry` S4 objects, one per context

Each FineMappingEntry has these attributes:

Attribute	Class	Description
`variantIds`	character	SNP IDs in the region (`chr:pos:ref:alt`)
`susieFit`	susie	Raw `susie()` output — contains `sets$cs` (credible sets), `lbf` (log Bayes factors), `alpha` (posterior inclusion probabilities per effect)
`topLoci`	data.frame	Per-variant summary table (see column reference below)
`cvResult`	list	Cross-validation result used to compute TWAS weights

topLoci column reference (1024 rows × 23 columns for this example):

Column	Description
`variant_id`	SNP ID (`chr:pos:ref:alt`)
`chrom`, `pos`	Chromosome and position
`A1`, `A2`	Effect and reference alleles
`N`	Sample size
`af`	Allele frequency of A1
`marginal_beta`, `marginal_se`	Marginal effect estimate and standard error
`marginal_z`, `marginal_p`	Marginal z-score and p-value
`pip`	Posterior Inclusion Probability — probability this variant is causal (0–1)
`posterior_mean`, `posterior_sd`	Posterior effect size estimate and uncertainty
`cs_95`, `cs_70`, `cs_50`	Which credible set the variant belongs to at each coverage level; empty string = not in any CS
`cs_95_purity`	Min pairwise LD r² within the 95% CS; ≥ 0.5 required for a credible CS
`method`	Fine-mapping method (`"susie"`)
`gene`, `event`	Gene ID and molecular event type
`grange_start`, `grange_end`	Association window boundaries

suppressPackageStartupMessages(library(pecotmr))

# Load QtlFineMappingResult
rds <- readRDS("output_uni/fine_mapping/protocol_example.ENSG00000130538.univariate_bvsr.rds")
class(rds)   # "QtlFineMappingResult"

ld <- attr(rds, "listData")
ld$context   # "context1"  "context2"
ld$method    # "susie"  "susie"

# --- topLoci: dimensions and column names ---
e1 <- ld$entry[[1]]          # context1
tl <- attr(e1, "topLoci")

dim(tl)        # 1024 x 23  (one row per variant in the region)
names(tl)
# [1] "variant_id"     "chrom"          "pos"            "A1"             "A2"
# [6] "N"              "af"             "marginal_beta"  "marginal_se"    "marginal_z"
# [11] "marginal_p"     "pip"            "posterior_mean" "posterior_sd"   "cs_95"
# [16] "cs_70"          "cs_50"          "cs_95_purity"   "method"         "gene"
# [21] "event"          "grange_start"   "grange_end"

# --- Step 1: check for valid credible sets ---
sf <- attr(e1, "susieFit")
sf$sets$cs    # NULL -> no credible set passed purity filter (toy data n=49)
sf$lbf        # 0.027 0.027 0.027 0.028 0.028 -> all near zero -> no detectable signal

# --- Step 2: top variants by PIP ---
tl_s <- tl[order(tl$pip, decreasing = TRUE), ]
head(tl_s[, c("variant_id", "af", "marginal_z", "marginal_p",
               "pip", "cs_95", "cs_95_purity")], 5)
#             variant_id        af  marginal_z  marginal_p    pip     cs_95  cs_95_purity
# chr22:16445263:A:T  0.837    3.482    4.99e-04  0.032  susie_0             0
# chr22:16418860:A:T  0.704   -3.436    5.91e-04  0.031  susie_0             0
# chr22:16214125:A:T  0.612    3.307    9.42e-04  0.027  susie_0             0

# cs_95 = "susie_0" but cs_95_purity = 0 on every row:
#   SuSiE found one CS spanning the whole region (all 1024 SNPs),
#   purity = 0 -> not a meaningful credible set.
# In real data (n >= 500): expect PIP > 0.5 lead variants
#   and cs_95_purity >= 0.5 with 1-10 SNPs per CS.

TWAS Weights Result: `TwasWeights` S4#

The TWAS weights RDS holds weights from all methods × all contexts for one gene. Top-level structure via attr(tw, "listData"):

Field	Description
`study`	Study name
`context`	Context label for each entry (repeated per method)
`trait`	Gene/trait ID
`method`	Method name: `mrash`, `susie`, `susie_inf`, `enet`, `lasso`, `mcp`, `scad`, `l0learn`, `bayes_r`, `bayes_c`, `ensemble`
`entry`	List of `TwasWeightsEntry` S4 objects (N_contexts × 11 methods total)

Each TwasWeightsEntry is accessed via attr(entry, "<name>") — not $ notation:

Attribute	Description
`variantIds`	SNP IDs (`chr:pos:ref:alt`) for all variants in the region
`weights`	Per-SNP TWAS weights — the β vector used in Z_TWAS = wᵀz / √(wᵀRw)
`fits`	Fitted model object (class-specific: `mr.ash`, `susie`, `glmnet`, etc.)
`cvPerformance`	CV results — for per-method entries: `$metrics` with `corr`, `rsq`, `adj_rsq`, `pval`, `RMSE`, `MAE`; for ensemble entry: `$methodCoef` (ζ per method) and `$methodPerformance` (CV R² per method)
`standardized`	Whether X/Y were z-score standardized before fitting (default: `FALSE`)

CV performance metrics tell you how well each method predicts expression using out-of-fold cross-validation:

rsq / adj_rsq: cross-validated R² — the primary metric for method quality
corr: Pearson correlation between predicted and observed expression
pval: p-value of the cross-validated prediction

Ensemble methodCoef (ζ): non-negative weights summing to 1 that determine how much each method contributes to the final ensemble prediction. w_ensemble = Σ ζ_k × w_k.

suppressPackageStartupMessages(library(pecotmr))

tw  <- readRDS("output_uni/twas_weights/protocol_example.ENSG00000130538.univariate_twas_weights.rds")
class(tw)     # "TwasWeights"

ld  <- attr(tw, "listData")
unique(ld$context)   # "context1"  "context2"
unique(ld$method)    # "mrash" "susie" "susie_inf" "enet" "lasso" "mcp" "scad" "l0learn" "bayes_r" "bayes_c" "ensemble"
length(ld$entry)     # 22

# Inspect one entry (context1 / susie)
e_susie <- ld$entry[ld$context == "context1" & ld$method == "susie"][[1]]
wt <- attr(e_susie, "weights")
cat("non-zero weights:", sum(wt != 0), "/", length(wt), "\n")

m <- attr(e_susie, "cvPerformance")$metrics
round(m[c("corr","rsq","adj_rsq","pval")], 4)
#    corr     rsq adj_rsq    pval
#  0.9798  0.9599  0.9591  0.0000

# Method summary for context1
idx1 <- which(ld$context == "context1")
for (i in idx1) {
  e <- ld$entry[[i]]
  m <- tryCatch(attr(e,"cvPerformance")$metrics, error=function(e) NULL)
  rsq <- if (!is.null(m) && is.numeric(m)) round(m["rsq"], 3) else NA
  cat(sprintf("%-12s  rsq=%s\n", ld$method[i], rsq))
}
# mrash         rsq=0.466
# susie         rsq=0.960
# susie_inf     rsq=0.305
# enet          rsq=0.060
# lasso         rsq=0.160
# mcp           rsq=NA
# scad          rsq=0.005
# l0learn       rsq=0.031
# bayes_r       rsq=0.520
# bayes_c       rsq=0.577
# ensemble      rsq=NA

# Ensemble coefficients
ens <- ld$entry[ld$context == "context1" & ld$method == "ensemble"][[1]]
cv  <- attr(ens, "cvPerformance")
round(cv$methodCoef[cv$methodCoef > 0.001], 3)   # bayes_r=0.802, bayes_c=0.198
round(cv$methodPerformance, 3)

Interpretation:

twas_weights$<method> — use for TWAS (Z ≈ weights · LD · GWAS z-scores).
twas_cv_result$performance — pick the method with highest adj_rsq before applying.
top_loci in univariate_bvsr.rds — lead variants and credible-set coverages.
susie_result_trimmed$sets$cs — indices of variants in each 95% credible set.

Common pitfalls#

Sample-ID drift. The phenotype BED’s header columns (5+), the bfile FID/IID, and the covariate TSV header must all be drawn from the same set. Sos does not warn on missing samples — it silently drops them and may report zero overlap.
Empty regions. If a region has no variants in its association window (e.g. the manifest points at a window with no bfile coverage) sos marks it completed with an empty output. Verify the expected file count in the INFO: susie_twas output: line.
--cis-window vs --customized-association-windows. Either/or. If both are passed the customized windows win. Pass --cis-window -1 to force the customized path explicitly.