Fine-mapping QTLs and GWAS with the pecotmr SuSiE Wrapper
pecotmr authors
2026-06-11
Source:vignettes/fine-mapping.Rmd
fine-mapping.RmdOverview
Fine-mapping reduces a region with one or more associations to a
small set of credible causal variants. pecotmr wraps SuSiE (Sum of Single
Effects) and SuSiE-RSS (its summary-statistics analog) and adds:
- A two-stage fit — SuSiE-inf first, then SuSiE initialized from the SuSiE-inf result — that improves credible-set purity in regions with background polygenic signal.
- Unified post-processing producing the same
22-column
top_locitable regardless of whether the input was individual-level data or summary statistics, and regardless of whether you fit SuSiE, SuSiE-inf, SuSiE-RSS, fSuSiE, or mvSuSiE. - Pipeline-level entry points that bundle loading, QC, fine-mapping, and (for individual-level data) TWAS-weight learning.
This vignette demonstrates all three workflows:
| Data you have | Entry point | Example dataset |
|---|---|---|
| Individual-level genotype + phenotype (QTL studies) | univariate_analysis_pipeline() |
eqtl_region_example |
| QTL summary statistics + LD reference | rss_analysis_pipeline() |
placeholder (real QTL nominal output) |
| GWAS summary statistics + LD reference | rss_analysis_pipeline() |
placeholder (real GWAS sumstats) |
For SuSiE theory, see the susieR
vignettes; this vignette focuses on how to drive it through
pecotmr.
Quick start: individual-level fine-mapping
eqtl_region_example contains a genotype matrix
X (415 samples × 2,828 variants) and a residualized
molecular phenotype y_res. Below we run the full pipeline —
initial PIP-screening, SuSiE-inf, SuSiE initialised from SuSiE-inf,
post-processing, and (optionally) TWAS-weight learning — in a single
call.
data(eqtl_region_example)
X <- eqtl_region_example$X
y <- as.numeric(eqtl_region_example$y_res)
maf <- apply(X, 2, function(g) {
f <- mean(g, na.rm = TRUE) / 2
min(f, 1 - f)
})
# Subset to a smaller window for a fast vignette build; the same call works
# on the full region.
set.seed(42)
sub <- sort(sample(ncol(X), 500))
fit <- univariate_analysis_pipeline(
X = X[, sub], Y = y, maf = maf[sub],
region = "chr22:32119000-33000000",
twas_weights = FALSE,
finemapping_extra_opts = list(refine = FALSE),
verbose = 0
)The return value is a list with the fitted models and the unified
top_loci table:
names(fit)## [1] "susie_inf_fitted" "susie_fitted"
## [3] "finemapping_result" "sumstats"
## [5] "other_quantities" "top_loci"
## [7] "susie_inf_result_trimmed" "total_time_elapsed"
dim(fit$top_loci)## [1] 22 22
## variant pip cs_95 cs_70 cs_95_purity method
## 1 chr22:32526384:T:G 0.02736065 susie_0 susie_0 0 susie
## 2 chr22:32538691:A:G 0.02736065 susie_0 susie_0 0 susie
## 3 chr22:32595330:A:G 0.12089319 susie_0 susie_0 0 susie
## 4 chr22:32607002:T:C 0.03388883 susie_0 susie_0 0 susie
## 5 chr22:32607795:T:C 0.03324444 susie_0 susie_0 0 susie
The methods column shows which model each row came from — the pipeline runs both SuSiE-inf and SuSiE by default and row-binds their top loci. For most downstream consumers, the SuSiE rows are what you want; SuSiE-inf rows are useful as a sanity check on inferred non-sparse effects.
Choosing among SuSiE variants
pecotmr exposes five SuSiE-family fine-mapping variants.
Pick which to fit via the methods parameter on
univariate_analysis_pipeline() (individual-level) or
rss_analysis_pipeline() / susie_rss_pipeline()
(summary statistics):
| Variant | Individual-level | Summary statistics | What it models |
|---|---|---|---|
| SuSiE alone | methods = "susie" |
methods = "susie_rss" |
Sparse causal effects only. No infinitesimal/background term. Lightest weight; can fold polygenic background into noisy “extra” credible sets. |
| SuSiE with SuSiE-inf init (default) |
methods = c("susie_inf", "susie") +
add_susie_inf = TRUE
|
methods = c("susie_inf_rss", "susie_rss") +
add_susie_inf = TRUE
|
Two-stage: SuSiE-inf absorbs background polygenicity first, SuSiE then refines sparse signals from that initialisation. Recommended default for most QTL and GWAS fine-mapping. |
| SuSiE-inf alone | methods = "susie_inf" |
methods = "susie_inf_rss" |
Sparse + infinitesimal joint fit. Useful for regions with strong polygenic background where the SuSiE refinement step is undesired. |
| SuSiE-ash alone | methods = "susie_ash" |
methods = "susie_ash_rss" |
Sparse + adaptive shrinkage (ASH) mixture. Useful when you suspect a mix of medium-effect and small-effect variants beyond a small handful of large ones. |
| SuSiE-ash with SuSiE-inf init |
methods = c("susie_inf", "susie_ash") +
add_susie_inf = TRUE
|
methods = c("susie_inf_rss", "susie_ash_rss") +
add_susie_inf = TRUE
|
Same chained two-stage idea as the SuSiE variant but with ASH on the downstream fit: SuSiE-inf absorbs polygenic background first, then SuSiE-ash refines using the mixture prior. |
The chained SuSiE-inf → SuSiE two-stage is the default both for backward compatibility and because it produces clean credible sets in most settings: SuSiE alone can fold polygenic background into noisy “extra” credible sets; SuSiE-inf absorbs that background first so the SuSiE refinement concentrates on sparse signals.
add_susie_inf = TRUE chains SuSiE-inf into
any of "susie" or "susie_ash"
that appear in methods. Listing all three
(methods = c("susie_inf", "susie", "susie_ash") with
add_susie_inf = TRUE) fits SuSiE-inf once, then both SuSiE
and SuSiE-ash initialised from it — useful for direct head-to-head
comparison on the same warm start.
All five populate the same 22-column top_loci table.
When you fit multiple methods in one call, the rows are row-bound and
labelled by the method column.
Note: SuSiE-inf chain init only carries information when the SuSiE-inf fit finds non-zero mappable effects. On a region where SuSiE-inf explains everything via its infinitesimal component (all
V == 0, allmu == 0), upstreamsusieRcorrectly treats the init as empty, and the chained downstream fit collapses to the independent run. This is the right behaviour, not a bug — but it means head-to-head differences only show up on regions with at least one sparse signal.
Backward compatibility with add_susie_inf
The pre-existing add_susie_inf boolean still works
exactly as before when methods is NULL (the
default):
-
add_susie_inf = TRUE(default) → fit SuSiE-inf, then chain it into SuSiE. Equivalent tomethods = c("susie_inf", "susie"). -
add_susie_inf = FALSE→ plain SuSiE alone. Equivalent tomethods = "susie".
When methods is set explicitly,
add_susie_inf controls whether the chained-init shortcut is
applied to any "susie" or "susie_ash" method
that appears alongside "susie_inf". Set
add_susie_inf = FALSE to fit each requested method
independently rather than chained.
Fitting multiple variants in one call
multi <- univariate_analysis_pipeline(
X = X[, sub], Y = y, maf = maf[sub],
methods = c("susie_inf", "susie", "susie_ash"),
twas_weights = FALSE,
signal_cutoff = 0
)
# top_loci has rows for all three methods; filter on `method` to inspect each
table(multi$top_loci$method)Lower-level entry points
If you want the raw fits without pecotmr’s
post-processing:
# Two-stage convenience for the recommended default:
fits <- fit_susie_inf_then_susie(X[, sub], y) # list(susie, susie_inf)
fits_rss <- fit_susie_inf_then_susie_rss(z, R, n)
# Or call susieR directly with the right unmappable_effects flag:
inf_fit <- susieR::susie(X, y, unmappable_effects = "inf",
convergence_method = "pip", refine = FALSE)
ash_fit <- susieR::susie(X, y, unmappable_effects = "ash",
convergence_method = "pip")
# Then post-process any combination through the unified pipeline:
post <- postprocess_finemapping_fits(
fits = list(susie_inf = inf_fit, susie_ash = ash_fit),
data_x = X[, sub], data_y = y, maf = maf[sub],
coverage = 0.95, secondary_coverage = c(0.7, 0.5)
)
out <- format_finemapping_output(post, primary_method = "susie_inf")The unified top_loci table
postprocess_finemapping_fits() produces a single table
with the following 22 columns, in this order:
| # | Column | Source |
|---|---|---|
| 1 | #chr |
parsed from variant
|
| 2 | start |
pos - 1 (BED 0-based) |
| 3 | end |
pos (BED half-open) |
| 4 | a1 |
from parsed variant
|
| 5 | a2 |
from parsed variant
|
| 6 | variant |
the row’s variant id (chr:pos:A2:A1) |
| 7 | gene |
phenotype identifier (colnames(data_y)[1]) |
| 8 | event |
paste(condition_id, gene, sep = "_") |
| 9 | n |
analyzed sample count |
| 10 | maf |
from the maf argument |
| 11 | beta |
univariate effect (sumstats$betahat) |
| 12 | se |
univariate SE (sumstats$sebetahat) |
| 13 | pip |
per-variant SuSiE PIP |
| 14 | posterior_effect_mean |
colSums(alpha * mu) |
| 15 | posterior_effect_se |
sqrt(pmax(colSums(alpha * mu2) - mean^2, 0)) |
| 16 | cs_95 |
CS membership at coverage 0.95 (see below) |
| 17 | cs_70 |
CS membership at coverage 0.70 |
| 18 | cs_50 |
CS membership at coverage 0.50 |
| 19 | cs_95_purity |
0.95-coverage CS purity for this row |
| 20 | method |
"susie", "susie_inf",
"susie_rss", … |
| 21 | grange_start |
parsed from the region argument |
| 22 | grange_end |
parsed from the region argument |
CS membership strings
cs_95, cs_70, cs_50 are
character strings, not integers. Each value is
<method>_<cs_index> where
<cs_index> is the per-method 1-based credible-set
number:
-
susie_1,susie_2, … for SuSiE credible sets -
susie_inf_1, … for SuSiE-inf credible sets -
<method>_0is the sentinel for “PIP-only retention, not in any CS at this coverage”
CS indices are not sequenced across methods — each method numbers its credible sets from 1.
Row uniqueness
The row identity is
(variant, gene, method, cs_membership):
- A variant in CS 1 and CS 2 of the same method (overlapping CS) gives two rows.
- A variant in both
susieCS 1 andsusie_infCS 1 gives two rows with the same(variant, gene)but differentmethod. - A PIP-retained variant not in any CS gives one row
per method with
cs_* = "<method>_0"andcs_95_purity = 0.
This shape preserves overlapping-CS evidence both within and across methods. Downstream consumers (xqtl-protocol, transMap, ColocBoost, TWAS) read the columns above; the schema is the contract.
QTL summary-statistics fine-mapping
When you have z-scores instead of X and Y,
use rss_analysis_pipeline(). It bundles:
-
load_rss_data()— tabix-region-extracts the sumstats and standardises column names via a YAML mapping (see the RSS QC vignette for the YAML schema). -
summary_stats_qc()— runs allele harmonization, SLALOM (default) or DENTIST QC, and optional RAISS imputation. The vignette referenced above covers QC in detail. -
susie_rss_pipeline()— fits the two-stage SuSiE-RSS model and post-processes into the unifiedtop_locitable.
Pipeline call (file-based, production shape)
# TODO(data): replace with real QTL sumstats and LD-metadata paths
qtl_sumstat_path <- "<path/to/qtl_nominal.tsv.gz>"
qtl_column_map <- "<path/to/qtl_column_map.yml>"
ld_meta_path <- "<path/to/ld_meta.tsv>"
region <- "<chr:start-end>"
gene_id <- "<ENSG...>"
n_qtl_sample <- 0 # 0 = read from sumstats
ld_data <- load_LD_matrix(ld_meta_path, region = region)
fm <- rss_analysis_pipeline(
sumstat_path = qtl_sumstat_path,
column_file_path = qtl_column_map,
LD_data = ld_data,
n_sample = n_qtl_sample,
region = region,
extract_region_name = gene_id,
region_name_col = 4L,
qc_method = "slalom",
finemapping_method = "susie_rss",
finemapping_opts = list(L = 20, L_greedy = 5,
coverage = c(0.95, 0.7, 0.5),
signal_cutoff = 0.025,
min_abs_corr = 0.8),
impute = TRUE
)
head(fm$rss_data_analyzed[[1]]$top_loci)Lower-level alternative (in-memory, runnable here)
For demonstration on the shipped example data, we skip the file-based
loading and use susie_rss_pipeline() directly with the
sumstats and LD already in memory. This is what
rss_analysis_pipeline() calls internally after QC.
data(gwas_sumstats_example)
data(eqtl_region_example)
# We treat the shipped GWAS-style sumstats as if they were QTL z-scores
# for this demonstration. In a real workflow you would load QTL nominal
# output via load_rss_data().
sumstats <- gwas_sumstats_example
X_ref <- eqtl_region_example$X
R <- compute_LD(X_ref, method = "sample")
R_sub <- R[sumstats$variant_id, sumstats$variant_id]
fm_rss <- susie_rss_pipeline(
sumstats = sumstats,
LD_mat = R_sub,
n = nrow(X_ref),
L = 10, L_greedy = 5,
analysis_method = "susie_rss",
coverage = 0.95,
secondary_coverage = c(0.7, 0.5)
)
names(fm_rss)## [1] "finemapping_result" "sumstats" "top_loci"
dim(fm_rss$top_loci)## [1] 20 22
## variant pip cs_95 cs_95_purity method
## 1 chr22:32337604:C:T 0.19820547 susie_rss_3 0.9846607 susie_rss
## 2 chr22:32350609:A:G 0.09322469 susie_rss_3 0.9846607 susie_rss
## 3 chr22:32354393:A:G 0.14380650 susie_rss_3 0.9846607 susie_rss
## 4 chr22:32362470:T:C 0.37141549 susie_rss_3 0.9846607 susie_rss
## 5 chr22:32362793:A:G 0.14992585 susie_rss_3 0.9846607 susie_rss
Note that the output schema is the same 22 columns as the
individual-level pipeline — method is
"susie_rss" instead of "susie", but everything
downstream (ColocBoost, xqtl-enrichment, TWAS) consumes the same
table.
GWAS summary-statistics fine-mapping
GWAS fine-mapping uses the same
rss_analysis_pipeline() / susie_rss_pipeline()
call. The only differences are at the loading layer:
-
Column mapping. GWAS conventions
(effect/non-effect, OR vs beta, per-variant
n) differ from QTL nominal output. The RSS QC vignette shows the YAML schema; common GWAS alternatives areeffect_allele/other_alleleforA1/A2,OR(you’ll needlog(OR)forbeta), andNforn_sample. -
No
extract_region_name. GWAS sumstats are not gene-keyed, so you slice by physical region only. Passregion = "chr:start-end"and omitextract_region_name/region_name_col. -
Per-variant sample size. GWAS sumstats often have a
per-variant
n_samplecolumn (some variants were imputed in fewer cohorts);load_rss_data()reads that automatically when the column is mapped.
# TODO(data): replace with real GWAS sumstats and LD-metadata paths
gwas_sumstat_path <- "<path/to/gwas.tsv.gz>"
gwas_column_map <- "<path/to/gwas_column_map.yml>"
ld_meta_path <- "<path/to/ld_meta.tsv>"
region <- "<chr:start-end>"
ld_data <- load_LD_matrix(ld_meta_path, region = region)
fm_gwas <- rss_analysis_pipeline(
sumstat_path = gwas_sumstat_path,
column_file_path = gwas_column_map,
LD_data = ld_data,
n_case = 0, # set if known; 0 reads from sumstats
n_control = 0,
region = region,
qc_method = "slalom",
finemapping_method = "susie_rss",
impute = TRUE
)The shipped gwas_finemapping_example dataset shows what
the output looks like for a real GWAS region:
## List of 1
## $ :List of 4
## ..$ alpha: num [1:10, 1:2828] 6.65e-29 0.00 1.08e-11 0.00 0.00 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ pip : Named num [1:2828] 1.08e-11 1.08e-11 3.36e-11 1.04e-11 1.05e-11 ...
## .. ..- attr(*, "names")= chr [1:2828] "chr22:32119788:T:C" "chr22:32119867:T:G" "chr22:32119961:T:G" "chr22:32120053:T:C" ...
## ..$ V : num [1:10] 3.16e-04 1.28e+01 1.19e-04 1.29e+01 1.29e+01 ...
## ..$ sets :List of 5
length(gwas_finemapping_example[[1]]$pip)## [1] 2828
sum(gwas_finemapping_example[[1]]$pip > 0.1)## [1] 10
gwas_finemapping_example ships in the legacy
susie_result_trimmed shape used by older pecotmr pipelines
(PIPs, alpha matrix, credible sets). New workflows should consume the
22-column top_loci table produced by
rss_analysis_pipeline() instead.
Downstream consumers
The top_loci table feeds directly into several other
pecotmr workflows:
-
Enrichment (
xqtl_enrichment_wrapper): computes prior enrichment of QTL fine-mapping signals at GWAS PIPs. See the enrichment vignette. -
Colocalization (
colocboost_pipeline): uses the trimmed SuSiE fits —finemapping_resultslot — as colocalization input. See the ColocBoost vignette. -
TWAS (
twas_weights_pipeline,twas_pipeline): uses the SuSiE effect estimates as weight learners for transcriptome-wide association testing. See the TWAS weights vignette and the TWAS Z-score vignette.
Summary
| Function | Use it when |
|---|---|
univariate_analysis_pipeline() |
Individual-level data: you have X (samples × variants),
Y (phenotype). Pick variant(s) via
methods = c("susie", "susie_inf", "susie_ash"). |
rss_analysis_pipeline() |
Summary statistics (QTL or GWAS): you have z-scores in a tabix file
and an LD reference. Pick variant(s) via
methods = c("susie_rss", "susie_inf_rss", "susie_ash_rss"). |
susie_rss_pipeline() |
Lower-level: you already have in-memory z,
R (or X), n. Same
methods argument as
rss_analysis_pipeline(). |
fit_susie_inf_then_susie() /
fit_susie_inf_then_susie_rss()
|
Lowest-level convenience for the two-stage chained default (SuSiE-inf init → SuSiE). |
postprocess_finemapping_fits() +
format_finemapping_output()
|
Post-process any combination of susie,
susie_inf, susie_ash, susie_rss,
susie_inf_rss, susie_ash_rss,
mvsusie, fsusie fits into the unified
22-column top_loci. |
All five entry points produce — or can be post-processed into — the
same 22-column top_loci table documented above, which is
the contract all downstream pecotmr consumers read.
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS/LAPACK: /home/runner/work/pecotmr/pecotmr/.pixi/envs/default/lib/libopenblasp-r0.3.33.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pecotmr_0.5.3
##
## loaded via a namespace (and not attached):
## [1] DBI_1.3.0 coloc_5.2.3
## [3] bitops_1.0-9 gridExtra_2.3
## [5] rlang_1.2.0 magrittr_2.0.5
## [7] otel_0.2.0 matrixStats_1.5.0
## [9] susieR_0.16.4 compiler_4.5.3
## [11] RSQLite_3.53.1 GenomicFeatures_1.62.0
## [13] colocboost_1.0.9 png_0.1-9
## [15] systemfonts_1.3.2 vctrs_0.7.3
## [17] quadprog_1.5-8 stringr_1.6.0
## [19] pkgconfig_2.0.3 crayon_1.5.3
## [21] fastmap_1.2.0 XVector_0.50.0
## [23] Rsamtools_2.26.0 rmarkdown_2.31
## [25] tzdb_0.5.0 UCSC.utils_1.6.1
## [27] ragg_1.5.2 purrr_1.2.2
## [29] bit_4.6.0 xfun_0.57
## [31] Rfast_2.1.5.2 cachem_1.1.0
## [33] cigarillo_1.0.0 GenomeInfoDb_1.46.2
## [35] jsonlite_2.0.0 blob_1.3.0
## [37] reshape_0.8.10 DelayedArray_0.36.0
## [39] tictoc_1.2.1 BiocParallel_1.44.0
## [41] irlba_2.3.7 parallel_4.5.3
## [43] R6_2.6.1 VariantAnnotation_1.56.0
## [45] bslib_0.11.0 stringi_1.8.7
## [47] RColorBrewer_1.1-3 rtracklayer_1.70.1
## [49] GenomicRanges_1.62.1 jquerylib_0.1.4
## [51] Rcpp_1.1.1-1.1 Seqinfo_1.0.0
## [53] SummarizedExperiment_1.40.0 knitr_1.51
## [55] R.utils_2.13.0 readr_2.2.0
## [57] IRanges_2.44.0 Matrix_1.7-5
## [59] tidyselect_1.2.1 viridis_0.6.5
## [61] abind_1.4-8 yaml_2.3.12
## [63] codetools_0.2-20 curl_7.1.0
## [65] plyr_1.8.9 lattice_0.22-9
## [67] tibble_3.3.1 Biobase_2.70.0
## [69] KEGGREST_1.50.0 S7_0.2.2
## [71] evaluate_1.0.5 desc_1.4.3
## [73] RcppParallel_5.1.11-2 Biostrings_2.78.0
## [75] pillar_1.11.1 MatrixGenerics_1.22.0
## [77] stats4_4.5.3 ieugwasr_1.1.0
## [79] generics_0.1.4 vroom_1.7.1
## [81] RCurl_1.98-1.19 hms_1.1.4
## [83] S4Vectors_0.48.0 ggplot2_4.0.3
## [85] scales_1.4.0 glue_1.8.1
## [87] tools_4.5.3 MungeSumstats_1.18.1
## [89] BiocIO_1.20.0 data.table_1.17.8
## [91] BSgenome_1.78.0 GenomicAlignments_1.46.0
## [93] fs_2.1.0 XML_3.99-0.22
## [95] grid_4.5.3 tidyr_1.3.2
## [97] AnnotationDbi_1.72.0 restfulr_0.0.16
## [99] cli_3.6.6 zigg_0.0.2
## [101] textshaping_1.0.5 viridisLite_0.4.3
## [103] mixsqp_0.3-54 S4Arrays_1.10.1
## [105] dplyr_1.2.1 gtable_0.3.6
## [107] R.methodsS3_1.8.2 sass_0.4.10
## [109] digest_0.6.39 BiocGenerics_0.56.0
## [111] SparseArray_1.10.8 rjson_0.2.23
## [113] htmlwidgets_1.6.4 farver_2.1.2
## [115] memoise_2.0.1 htmltools_0.5.9
## [117] pkgdown_2.2.0 R.oo_1.27.1
## [119] lifecycle_1.0.5 httr_1.4.8
## [121] bit64_4.8.2 MASS_7.3-65