Steps#

Univariate Fine-Mapping and TWAS with SuSiE#

Description#

This pipeline performs per-region univariate fine-mapping with SuSiE and computes TWAS weights, via the susie_twas and mnm workflows of pipeline/mnm_regression.ipynb.

  • qtl_dataset_construct+susie_twas — builds a QtlDataset RDS from genotype/phenotype/covariate files, then for each region fits SuSiE fine-mapping and trains TWAS prediction weights with 10 methods (mrash, susie, susie_inf, enet, lasso, mcp, scad, l0learn, bayes_r, bayes_c), plus a CV-based ensemble weight. Produces *.univariate_bvsr.rds (fine-mapping) and *.univariate_twas_weights.rds (TWAS weights).

  • mnm — reads all susie_twas outputs (via --fine-mapping-meta), pools CV predictions across multiple contexts, solves a constrained QP to find the optimal method/context mixture, and produces *.multicontext_twas_weights.rds.

SuSiE searches for up to L=5 independent signals per region. Covariates are regressed out of both phenotype and genotype matrices before fitting (SuSiE itself does not accept covariates).

What is a “region”?#

A region is one row in the phenotype manifest (--phenoFile) plus its association window (from --customized-association-windows). One susie_twas call fits one SuSiE + TWAS model per region.

  • Gene-based phenotypes. Manifest row = one gene; window = gene’s TSS ± cis window (or from a pre-built TADB BED).

  • Peak-based phenotypes (caQTL, mQTL). Manifest row = one peak/CpG; window = the peak’s containing extended-TAD.

Input Files#

File

Description

<out_dir>/pheno_manifest.tsv

Phenotype manifest mapping each region to its phenotype/covariate files

<out_dir>/association_windows.bed

Association-test windows (one row per region)

<out_dir>/peaks_split/<ID>.bed.gz{,.tbi}

Per-region phenotype data, bgzipped and tabix-indexed

Step 1 – Prepare the phenotype manifest and association-windows BED#

The pipeline expects two derived files per experiment:

  • pheno_manifest.tsv — 5 columns (#chr start end ID path), one row per region. path points to a bgzipped+tabixed phenotype BED restricted to that region.

  • association_windows.bed — 4 columns (chr start end ID), one row per region. start/end define the cis search window; ID must match a row in the manifest.

For gene-based phenotypes these files are produced by the standard phenotype preprocessing + TADB generation steps.

For peak-based phenotypes (caQTL, mQTL) you can derive both from the raw peak BED with the helper below; it renames the peak-ID header, splits the multi-sample BED into one bgz+tbi per peak, writes the manifest, and annotates each peak with its smallest containing extended-TAD.

bash code/misc/preprocess/prepare_peak_inputs.sh \
    <raw_peaks.bed.gz> \
    <extended_TADB.bed> \
    <out_dir>

# Produces:
#   <out_dir>/pheno_manifest.tsv
#   <out_dir>/association_windows.bed
#   <out_dir>/peaks_split/<ID>.bed.gz{,.tbi}

Step 2 – Run susie_twas + mnm#

Gene-based example (single gene, 2-context toy dataset):

# Step 1: build QtlDataset + run SuSiE fine-mapping + TWAS weights (all-in-one)
sos run pipeline/mnm_regression.ipynb qtl_dataset_construct+susie_twas \
    --name protocol_example --cwd output_uni \
    --genoFile input/colocboost/example.chr22.bed \
    --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
    --covFile input/colocboost/example_covariates.tsv \
    --customized-association-windows input/colocboost/association_windows.bed \
    --region-name ENSG00000130538 --transpose-covariates --save-data -j1

# Step 2: multi-context ensemble TWAS weights (mnm)
sos run pipeline/mnm_regression.ipynb mnm \
    --name protocol_example --cwd output_mnm \
    --genoFile input/colocboost/example.chr22.bed \
    --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
    --covFile input/colocboost/example_covariates.tsv \
    --customized-association-windows input/colocboost/association_windows.bed \
    --fine-mapping-meta input/colocboost/fine_mapping_meta.tsv \
    --transpose-covariates --save-data -j1

Peak-based example (same pipeline, swap phenotype/covariate inputs):

# Peak/chromatin example (same pipeline, different input)
sos run pipeline/mnm_regression.ipynb susie_twas \
    --name protocol_example \
    --cwd output_10peaks \
    --genoFile input/genotype/protocol_example.genotype.chr22.bed \
    --phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \
    --covFile input/covariate/protocol_example.covariates.tsv \
    --customized-association-windows input/finemapping/protocol_example.association_windows.bed \
    -j1

Expected output files after susie_twas:

output_uni/fine_mapping/protocol_example.ENSG00000130538.univariate_bvsr.rds
output_uni/twas_weights/protocol_example.ENSG00000130538.univariate_twas_weights.rds
output_uni/fine_mapping_meta.tsv   ← auto-generated, used by mnm as --fine-mapping-meta

After mnm:

output_mnm/twas_weights/protocol_example.ENSG00000130538.multicontext_twas_weights.rds

Command Interface#

sos run pipeline/mnm_regression.ipynb -h

Key parameters for qtl_dataset_construct+susie_twas:

Parameter

Description

--name

Study name prefix; all output files are named {name}.{region_id}.*

--cwd

Output root directory; subdirs qtl_dataset/, fine_mapping/, twas_weights/ are created automatically

--genoFile

PLINK1 .bed file path (.bim/.fam auto-located); FID/IID must match phenotype/covariate sample IDs

--phenoFile

Phenotype manifest TSV — columns ID, path, cond (one row per context); path points to a bgzipped+tabix-indexed phenotype BED

--covFile

Covariate TSV — rows = covariates, columns = samples (use --transpose-covariates if rows = samples)

--customized-association-windows

4-column chr start end ID BED defining the cis search window per region

--region-name

Restrict analysis to one or more region IDs (space-separated); omit to run all regions

--transpose-covariates

Transpose covariate file before reading (needed when rows = samples)

--save-data

Also emit *.univariate_data.rds with residualized X/Y matrices; required for downstream mnm step

-j1

Number of parallel jobs (set to 1 for debugging)

Additional parameter for mnm:

Parameter

Description

--fine-mapping-meta

TSV auto-generated by susie_twas (at {cwd}/fine_mapping_meta.tsv); maps each region to its univariate_bvsr.rds and univariate_twas_weights.rds paths

Output Format#

All outputs are written under --cwd, one set per region:

Fine-mapping (output_uni/fine_mapping/):

  • {name}.{region_id}.univariate_bvsr.rdsQtlFineMappingResult S4 object, one entry per context

TWAS weights (output_uni/twas_weights/):

  • {name}.{region_id}.univariate_twas_weights.rdsTwasWeights S4 object (per-method + ensemble)

Ensemble TWAS weights (after mnm step, output_mnm/twas_weights/):

  • {name}.{region_id}.multicontext_twas_weights.rds — multi-context ensemble TwasWeights S4

QtlFineMappingResult S4 structure#

Top-level via attr(rds, "listData"):

Field

Description

study

Study name (from --name)

context

Context label per entry

trait

Gene/trait ID

method

Fine-mapping method ("susie")

entry

List of FineMappingEntry S4 objects (one per context)

Each FineMappingEntry has these attributes:

Attribute

Description

variantIds

SNP IDs (chr:pos:ref:alt) for all variants in the region

susieFit

Raw susie() output — sets$cs (credible sets after purity filter), lbf (log Bayes factors per effect component), alpha (posterior inclusion per effect)

topLoci

Per-variant summary data.frame (see column reference in Anticipated Results)

cvResult

Cross-validation result used in TWAS weight computation

TwasWeights S4 structure#

Top-level via attr(tw, "listData"):

Field

Description

study

Study name

context

Context label per entry (repeated per method)

trait

Gene/trait ID

method

Method name: mrash, susie, susie_inf, enet, lasso, mcp, scad, l0learn, bayes_r, bayes_c, ensemble

entry

List of TwasWeightsEntry S4 objects (N_contexts × 11 methods total)

Each TwasWeightsEntry accessed via attr(entry, "<name>"):

Attribute

Description

variantIds

SNP IDs in the region

weights

Per-SNP TWAS weights (β from genotype→expression model)

fits

Fitted model object (class-specific)

cvPerformance

CV results: $metrics (corr, rsq, adj_rsq, pval, RMSE, MAE) for per-method entries; $methodCoef + $methodPerformance for ensemble

standardized

Whether X/Y were standardized (default FALSE)

Anticipated Results#

Quick sanity check — inspect cross-validation performance for the first region:

Fine-Mapping Result: QtlFineMappingResult S4#

The fine-mapping RDS holds all contexts for one gene in a single object. Access the structured data via attr(rds, "listData"):

Field

Description

study

Study name string (from --study)

context

Context label for each entry (one entry per context)

trait

Gene/trait ID (Ensembl)

method

Fine-mapping method used ("susie")

entry

List of FineMappingEntry S4 objects, one per context

Each FineMappingEntry has these attributes:

Attribute

Class

Description

variantIds

character

SNP IDs in the region (chr:pos:ref:alt)

susieFit

susie

Raw susie() output — contains sets$cs (credible sets), lbf (log Bayes factors), alpha (posterior inclusion probabilities per effect)

topLoci

data.frame

Per-variant summary table (see column reference below)

cvResult

list

Cross-validation result used to compute TWAS weights

topLoci column reference (1024 rows × 23 columns for this example):

Column

Description

variant_id

SNP ID (chr:pos:ref:alt)

chrom, pos

Chromosome and position

A1, A2

Effect and reference alleles

N

Sample size

af

Allele frequency of A1

marginal_beta, marginal_se

Marginal effect estimate and standard error

marginal_z, marginal_p

Marginal z-score and p-value

pip

Posterior Inclusion Probability — probability this variant is causal (0–1)

posterior_mean, posterior_sd

Posterior effect size estimate and uncertainty

cs_95, cs_70, cs_50

Which credible set the variant belongs to at each coverage level; empty string = not in any CS

cs_95_purity

Min pairwise LD r² within the 95% CS; ≥ 0.5 required for a credible CS

method

Fine-mapping method ("susie")

gene, event

Gene ID and molecular event type

grange_start, grange_end

Association window boundaries

suppressPackageStartupMessages(library(pecotmr))

# Load QtlFineMappingResult
rds <- readRDS("output_uni/fine_mapping/protocol_example.ENSG00000130538.univariate_bvsr.rds")
class(rds)   # "QtlFineMappingResult"

ld <- attr(rds, "listData")
ld$context   # "context1"  "context2"
ld$method    # "susie"  "susie"

# --- topLoci: dimensions and column names ---
e1 <- ld$entry[[1]]          # context1
tl <- attr(e1, "topLoci")

dim(tl)        # 1024 x 23  (one row per variant in the region)
names(tl)
# [1] "variant_id"     "chrom"          "pos"            "A1"             "A2"
# [6] "N"              "af"             "marginal_beta"  "marginal_se"    "marginal_z"
# [11] "marginal_p"     "pip"            "posterior_mean" "posterior_sd"   "cs_95"
# [16] "cs_70"          "cs_50"          "cs_95_purity"   "method"         "gene"
# [21] "event"          "grange_start"   "grange_end"

# --- Step 1: check for valid credible sets ---
sf <- attr(e1, "susieFit")
sf$sets$cs    # NULL -> no credible set passed purity filter (toy data n=49)
sf$lbf        # 0.027 0.027 0.027 0.028 0.028 -> all near zero -> no detectable signal

# --- Step 2: top variants by PIP ---
tl_s <- tl[order(tl$pip, decreasing = TRUE), ]
head(tl_s[, c("variant_id", "af", "marginal_z", "marginal_p",
               "pip", "cs_95", "cs_95_purity")], 5)
#             variant_id        af  marginal_z  marginal_p    pip     cs_95  cs_95_purity
# chr22:16445263:A:T  0.837    3.482    4.99e-04  0.032  susie_0             0
# chr22:16418860:A:T  0.704   -3.436    5.91e-04  0.031  susie_0             0
# chr22:16214125:A:T  0.612    3.307    9.42e-04  0.027  susie_0             0

# cs_95 = "susie_0" but cs_95_purity = 0 on every row:
#   SuSiE found one CS spanning the whole region (all 1024 SNPs),
#   purity = 0 -> not a meaningful credible set.
# In real data (n >= 500): expect PIP > 0.5 lead variants
#   and cs_95_purity >= 0.5 with 1-10 SNPs per CS.

TWAS Weights Result: TwasWeights S4#

The TWAS weights RDS holds weights from all methods × all contexts for one gene. Top-level structure via attr(tw, "listData"):

Field

Description

study

Study name

context

Context label for each entry (repeated per method)

trait

Gene/trait ID

method

Method name: mrash, susie, susie_inf, enet, lasso, mcp, scad, l0learn, bayes_r, bayes_c, ensemble

entry

List of TwasWeightsEntry S4 objects (N_contexts × 11 methods total)

Each TwasWeightsEntry is accessed via attr(entry, "<name>")not $ notation:

Attribute

Description

variantIds

SNP IDs (chr:pos:ref:alt) for all variants in the region

weights

Per-SNP TWAS weights — the β vector used in Z_TWAS = wz / √(wᵀRw)

fits

Fitted model object (class-specific: mr.ash, susie, glmnet, etc.)

cvPerformance

CV results — for per-method entries: $metrics with corr, rsq, adj_rsq, pval, RMSE, MAE; for ensemble entry: $methodCoef (ζ per method) and $methodPerformance (CV R² per method)

standardized

Whether X/Y were z-score standardized before fitting (default: FALSE)

CV performance metrics tell you how well each method predicts expression using out-of-fold cross-validation:

  • rsq / adj_rsq: cross-validated R² — the primary metric for method quality

  • corr: Pearson correlation between predicted and observed expression

  • pval: p-value of the cross-validated prediction

Ensemble methodCoef (ζ): non-negative weights summing to 1 that determine how much each method contributes to the final ensemble prediction. w_ensemble = Σ ζ_k × w_k.

suppressPackageStartupMessages(library(pecotmr))

tw  <- readRDS("output_uni/twas_weights/protocol_example.ENSG00000130538.univariate_twas_weights.rds")
class(tw)     # "TwasWeights"

ld  <- attr(tw, "listData")
unique(ld$context)   # "context1"  "context2"
unique(ld$method)    # "mrash" "susie" "susie_inf" "enet" "lasso" "mcp" "scad" "l0learn" "bayes_r" "bayes_c" "ensemble"
length(ld$entry)     # 22

# Inspect one entry (context1 / susie)
e_susie <- ld$entry[ld$context == "context1" & ld$method == "susie"][[1]]
wt <- attr(e_susie, "weights")
cat("non-zero weights:", sum(wt != 0), "/", length(wt), "\n")

m <- attr(e_susie, "cvPerformance")$metrics
round(m[c("corr","rsq","adj_rsq","pval")], 4)
#    corr     rsq adj_rsq    pval
#  0.9798  0.9599  0.9591  0.0000

# Method summary for context1
idx1 <- which(ld$context == "context1")
for (i in idx1) {
  e <- ld$entry[[i]]
  m <- tryCatch(attr(e,"cvPerformance")$metrics, error=function(e) NULL)
  rsq <- if (!is.null(m) && is.numeric(m)) round(m["rsq"], 3) else NA
  cat(sprintf("%-12s  rsq=%s\n", ld$method[i], rsq))
}
# mrash         rsq=0.466
# susie         rsq=0.960
# susie_inf     rsq=0.305
# enet          rsq=0.060
# lasso         rsq=0.160
# mcp           rsq=NA
# scad          rsq=0.005
# l0learn       rsq=0.031
# bayes_r       rsq=0.520
# bayes_c       rsq=0.577
# ensemble      rsq=NA

# Ensemble coefficients
ens <- ld$entry[ld$context == "context1" & ld$method == "ensemble"][[1]]
cv  <- attr(ens, "cvPerformance")
round(cv$methodCoef[cv$methodCoef > 0.001], 3)   # bayes_r=0.802, bayes_c=0.198
round(cv$methodPerformance, 3)

Interpretation:

  • twas_weights$<method> — use for TWAS (Z weights · LD · GWAS z-scores).

  • twas_cv_result$performance — pick the method with highest adj_rsq before applying.

  • top_loci in univariate_bvsr.rds — lead variants and credible-set coverages.

  • susie_result_trimmed$sets$cs — indices of variants in each 95% credible set.

Common pitfalls#

  • Sample-ID drift. The phenotype BED’s header columns (5+), the bfile FID/IID, and the covariate TSV header must all be drawn from the same set. Sos does not warn on missing samples — it silently drops them and may report zero overlap.

  • Empty regions. If a region has no variants in its association window (e.g. the manifest points at a window with no bfile coverage) sos marks it completed with an empty output. Verify the expected file count in the INFO: susie_twas output: line.

  • --cis-window vs --customized-association-windows. Either/or. If both are passed the customized windows win. Pass --cis-window -1 to force the customized path explicitly.