Steps#
Univariate Fine-Mapping and TWAS with SuSiE#
Description#
This pipeline performs per-region univariate fine-mapping with SuSiE and computes TWAS
weights, via the susie_twas and mnm workflows of pipeline/mnm_regression.ipynb.
qtl_dataset_construct+susie_twas— builds aQtlDatasetRDS from genotype/phenotype/covariate files, then for each region fits SuSiE fine-mapping and trains TWAS prediction weights with 10 methods (mrash, susie, susie_inf, enet, lasso, mcp, scad, l0learn, bayes_r, bayes_c), plus a CV-based ensemble weight. Produces*.univariate_bvsr.rds(fine-mapping) and*.univariate_twas_weights.rds(TWAS weights).mnm— reads allsusie_twasoutputs (via--fine-mapping-meta), pools CV predictions across multiple contexts, solves a constrained QP to find the optimal method/context mixture, and produces*.multicontext_twas_weights.rds.
SuSiE searches for up to L=5 independent signals per region. Covariates are regressed out of both phenotype and genotype matrices before fitting (SuSiE itself does not accept covariates).
What is a “region”?#
A region is one row in the phenotype manifest (--phenoFile) plus its association
window (from --customized-association-windows). One susie_twas call fits one SuSiE + TWAS model per region.
Gene-based phenotypes. Manifest row = one gene; window = gene’s TSS ± cis window (or from a pre-built TADB BED).
Peak-based phenotypes (caQTL, mQTL). Manifest row = one peak/CpG; window = the peak’s containing extended-TAD.
Input Files#
File |
Description |
|---|---|
|
Phenotype manifest mapping each region to its phenotype/covariate files |
|
Association-test windows (one row per region) |
|
Per-region phenotype data, bgzipped and tabix-indexed |
Step 1 – Prepare the phenotype manifest and association-windows BED#
The pipeline expects two derived files per experiment:
pheno_manifest.tsv— 5 columns (#chr start end ID path), one row per region.pathpoints to a bgzipped+tabixed phenotype BED restricted to that region.association_windows.bed— 4 columns (chr start end ID), one row per region.start/enddefine the cis search window;IDmust match a row in the manifest.
For gene-based phenotypes these files are produced by the standard phenotype preprocessing + TADB generation steps.
For peak-based phenotypes (caQTL, mQTL) you can derive both from the raw peak BED with the helper below; it renames the peak-ID header, splits the multi-sample BED into one bgz+tbi per peak, writes the manifest, and annotates each peak with its smallest containing extended-TAD.
bash code/misc/preprocess/prepare_peak_inputs.sh \
<raw_peaks.bed.gz> \
<extended_TADB.bed> \
<out_dir>
# Produces:
# <out_dir>/pheno_manifest.tsv
# <out_dir>/association_windows.bed
# <out_dir>/peaks_split/<ID>.bed.gz{,.tbi}
Step 2 – Run susie_twas + mnm#
Gene-based example (single gene, 2-context toy dataset):
# Step 1: build QtlDataset + run SuSiE fine-mapping + TWAS weights (all-in-one)
sos run pipeline/mnm_regression.ipynb qtl_dataset_construct+susie_twas \
--name protocol_example --cwd output_uni \
--genoFile input/colocboost/example.chr22.bed \
--phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
--covFile input/colocboost/example_covariates.tsv \
--customized-association-windows input/colocboost/association_windows.bed \
--region-name ENSG00000130538 --transpose-covariates --save-data -j1
# Step 2: multi-context ensemble TWAS weights (mnm)
sos run pipeline/mnm_regression.ipynb mnm \
--name protocol_example --cwd output_mnm \
--genoFile input/colocboost/example.chr22.bed \
--phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
--covFile input/colocboost/example_covariates.tsv \
--customized-association-windows input/colocboost/association_windows.bed \
--fine-mapping-meta input/colocboost/fine_mapping_meta.tsv \
--transpose-covariates --save-data -j1
Peak-based example (same pipeline, swap phenotype/covariate inputs):
# Peak/chromatin example (same pipeline, different input)
sos run pipeline/mnm_regression.ipynb susie_twas \
--name protocol_example \
--cwd output_10peaks \
--genoFile input/genotype/protocol_example.genotype.chr22.bed \
--phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \
--covFile input/covariate/protocol_example.covariates.tsv \
--customized-association-windows input/finemapping/protocol_example.association_windows.bed \
-j1
Expected output files after susie_twas:
output_uni/fine_mapping/protocol_example.ENSG00000130538.univariate_bvsr.rds
output_uni/twas_weights/protocol_example.ENSG00000130538.univariate_twas_weights.rds
output_uni/fine_mapping_meta.tsv ← auto-generated, used by mnm as --fine-mapping-meta
After mnm:
output_mnm/twas_weights/protocol_example.ENSG00000130538.multicontext_twas_weights.rds
Command Interface#
sos run pipeline/mnm_regression.ipynb -h
Key parameters for qtl_dataset_construct+susie_twas:
Parameter |
Description |
|---|---|
|
Study name prefix; all output files are named |
|
Output root directory; subdirs |
|
PLINK1 |
|
Phenotype manifest TSV — columns |
|
Covariate TSV — rows = covariates, columns = samples (use |
|
4-column |
|
Restrict analysis to one or more region IDs (space-separated); omit to run all regions |
|
Transpose covariate file before reading (needed when rows = samples) |
|
Also emit |
|
Number of parallel jobs (set to 1 for debugging) |
Additional parameter for mnm:
Parameter |
Description |
|---|---|
|
TSV auto-generated by |
Output Format#
All outputs are written under --cwd, one set per region:
Fine-mapping (output_uni/fine_mapping/):
{name}.{region_id}.univariate_bvsr.rds—QtlFineMappingResultS4 object, one entry per context
TWAS weights (output_uni/twas_weights/):
{name}.{region_id}.univariate_twas_weights.rds—TwasWeightsS4 object (per-method + ensemble)
Ensemble TWAS weights (after mnm step, output_mnm/twas_weights/):
{name}.{region_id}.multicontext_twas_weights.rds— multi-context ensembleTwasWeightsS4
QtlFineMappingResult S4 structure#
Top-level via attr(rds, "listData"):
Field |
Description |
|---|---|
|
Study name (from |
|
Context label per entry |
|
Gene/trait ID |
|
Fine-mapping method ( |
|
List of |
Each FineMappingEntry has these attributes:
Attribute |
Description |
|---|---|
|
SNP IDs ( |
|
Raw |
|
Per-variant summary data.frame (see column reference in Anticipated Results) |
|
Cross-validation result used in TWAS weight computation |
TwasWeights S4 structure#
Top-level via attr(tw, "listData"):
Field |
Description |
|---|---|
|
Study name |
|
Context label per entry (repeated per method) |
|
Gene/trait ID |
|
Method name: |
|
List of |
Each TwasWeightsEntry accessed via attr(entry, "<name>"):
Attribute |
Description |
|---|---|
|
SNP IDs in the region |
|
Per-SNP TWAS weights (β from genotype→expression model) |
|
Fitted model object (class-specific) |
|
CV results: |
|
Whether X/Y were standardized (default |
Anticipated Results#
Quick sanity check — inspect cross-validation performance for the first region:
Fine-Mapping Result: QtlFineMappingResult S4#
The fine-mapping RDS holds all contexts for one gene in a single object.
Access the structured data via attr(rds, "listData"):
Field |
Description |
|---|---|
|
Study name string (from |
|
Context label for each entry (one entry per context) |
|
Gene/trait ID (Ensembl) |
|
Fine-mapping method used ( |
|
List of |
Each FineMappingEntry has these attributes:
Attribute |
Class |
Description |
|---|---|---|
|
character |
SNP IDs in the region ( |
|
susie |
Raw |
|
data.frame |
Per-variant summary table (see column reference below) |
|
list |
Cross-validation result used to compute TWAS weights |
topLoci column reference (1024 rows × 23 columns for this example):
Column |
Description |
|---|---|
|
SNP ID ( |
|
Chromosome and position |
|
Effect and reference alleles |
|
Sample size |
|
Allele frequency of A1 |
|
Marginal effect estimate and standard error |
|
Marginal z-score and p-value |
|
Posterior Inclusion Probability — probability this variant is causal (0–1) |
|
Posterior effect size estimate and uncertainty |
|
Which credible set the variant belongs to at each coverage level; empty string = not in any CS |
|
Min pairwise LD r² within the 95% CS; ≥ 0.5 required for a credible CS |
|
Fine-mapping method ( |
|
Gene ID and molecular event type |
|
Association window boundaries |
suppressPackageStartupMessages(library(pecotmr))
# Load QtlFineMappingResult
rds <- readRDS("output_uni/fine_mapping/protocol_example.ENSG00000130538.univariate_bvsr.rds")
class(rds) # "QtlFineMappingResult"
ld <- attr(rds, "listData")
ld$context # "context1" "context2"
ld$method # "susie" "susie"
# --- topLoci: dimensions and column names ---
e1 <- ld$entry[[1]] # context1
tl <- attr(e1, "topLoci")
dim(tl) # 1024 x 23 (one row per variant in the region)
names(tl)
# [1] "variant_id" "chrom" "pos" "A1" "A2"
# [6] "N" "af" "marginal_beta" "marginal_se" "marginal_z"
# [11] "marginal_p" "pip" "posterior_mean" "posterior_sd" "cs_95"
# [16] "cs_70" "cs_50" "cs_95_purity" "method" "gene"
# [21] "event" "grange_start" "grange_end"
# --- Step 1: check for valid credible sets ---
sf <- attr(e1, "susieFit")
sf$sets$cs # NULL -> no credible set passed purity filter (toy data n=49)
sf$lbf # 0.027 0.027 0.027 0.028 0.028 -> all near zero -> no detectable signal
# --- Step 2: top variants by PIP ---
tl_s <- tl[order(tl$pip, decreasing = TRUE), ]
head(tl_s[, c("variant_id", "af", "marginal_z", "marginal_p",
"pip", "cs_95", "cs_95_purity")], 5)
# variant_id af marginal_z marginal_p pip cs_95 cs_95_purity
# chr22:16445263:A:T 0.837 3.482 4.99e-04 0.032 susie_0 0
# chr22:16418860:A:T 0.704 -3.436 5.91e-04 0.031 susie_0 0
# chr22:16214125:A:T 0.612 3.307 9.42e-04 0.027 susie_0 0
# cs_95 = "susie_0" but cs_95_purity = 0 on every row:
# SuSiE found one CS spanning the whole region (all 1024 SNPs),
# purity = 0 -> not a meaningful credible set.
# In real data (n >= 500): expect PIP > 0.5 lead variants
# and cs_95_purity >= 0.5 with 1-10 SNPs per CS.
TWAS Weights Result: TwasWeights S4#
The TWAS weights RDS holds weights from all methods × all contexts for one gene.
Top-level structure via attr(tw, "listData"):
Field |
Description |
|---|---|
|
Study name |
|
Context label for each entry (repeated per method) |
|
Gene/trait ID |
|
Method name: |
|
List of |
Each TwasWeightsEntry is accessed via attr(entry, "<name>") — not $ notation:
Attribute |
Description |
|---|---|
|
SNP IDs ( |
|
Per-SNP TWAS weights — the β vector used in Z_TWAS = wᵀz / √(wᵀRw) |
|
Fitted model object (class-specific: |
|
CV results — for per-method entries: |
|
Whether X/Y were z-score standardized before fitting (default: |
CV performance metrics tell you how well each method predicts expression using out-of-fold cross-validation:
rsq/adj_rsq: cross-validated R² — the primary metric for method qualitycorr: Pearson correlation between predicted and observed expressionpval: p-value of the cross-validated prediction
Ensemble methodCoef (ζ): non-negative weights summing to 1 that determine how much each method contributes to the final ensemble prediction. w_ensemble = Σ ζ_k × w_k.
suppressPackageStartupMessages(library(pecotmr))
tw <- readRDS("output_uni/twas_weights/protocol_example.ENSG00000130538.univariate_twas_weights.rds")
class(tw) # "TwasWeights"
ld <- attr(tw, "listData")
unique(ld$context) # "context1" "context2"
unique(ld$method) # "mrash" "susie" "susie_inf" "enet" "lasso" "mcp" "scad" "l0learn" "bayes_r" "bayes_c" "ensemble"
length(ld$entry) # 22
# Inspect one entry (context1 / susie)
e_susie <- ld$entry[ld$context == "context1" & ld$method == "susie"][[1]]
wt <- attr(e_susie, "weights")
cat("non-zero weights:", sum(wt != 0), "/", length(wt), "\n")
m <- attr(e_susie, "cvPerformance")$metrics
round(m[c("corr","rsq","adj_rsq","pval")], 4)
# corr rsq adj_rsq pval
# 0.9798 0.9599 0.9591 0.0000
# Method summary for context1
idx1 <- which(ld$context == "context1")
for (i in idx1) {
e <- ld$entry[[i]]
m <- tryCatch(attr(e,"cvPerformance")$metrics, error=function(e) NULL)
rsq <- if (!is.null(m) && is.numeric(m)) round(m["rsq"], 3) else NA
cat(sprintf("%-12s rsq=%s\n", ld$method[i], rsq))
}
# mrash rsq=0.466
# susie rsq=0.960
# susie_inf rsq=0.305
# enet rsq=0.060
# lasso rsq=0.160
# mcp rsq=NA
# scad rsq=0.005
# l0learn rsq=0.031
# bayes_r rsq=0.520
# bayes_c rsq=0.577
# ensemble rsq=NA
# Ensemble coefficients
ens <- ld$entry[ld$context == "context1" & ld$method == "ensemble"][[1]]
cv <- attr(ens, "cvPerformance")
round(cv$methodCoef[cv$methodCoef > 0.001], 3) # bayes_r=0.802, bayes_c=0.198
round(cv$methodPerformance, 3)
Interpretation:
twas_weights$<method>— use for TWAS (Z ≈ weights · LD · GWAS z-scores).twas_cv_result$performance— pick the method with highestadj_rsqbefore applying.top_lociinunivariate_bvsr.rds— lead variants and credible-set coverages.susie_result_trimmed$sets$cs— indices of variants in each 95% credible set.
Common pitfalls#
Sample-ID drift. The phenotype BED’s header columns (5+), the bfile FID/IID, and the covariate TSV header must all be drawn from the same set. Sos does not warn on missing samples — it silently drops them and may report zero overlap.
Empty regions. If a region has no variants in its association window (e.g. the manifest points at a window with no bfile coverage) sos marks it completed with an empty output. Verify the expected file count in the
INFO: susie_twas output:line.--cis-windowvs--customized-association-windows. Either/or. If both are passed the customized windows win. Pass--cis-window -1to force the customized path explicitly.