Univariate Fine-Mapping and TWAS with SuSiE#
Overview#
This pipeline performs per-region univariate fine-mapping with SuSiE and computes TWAS
weights, via the susie_twas workflow of pipeline/mnm_regression.ipynb. For each
region the workflow fits SuSiE on the covariate-residualized phenotype × genotype
matrix and then cross-validates five prediction models (enet, lasso,
bayes_r, mrash, susie) to pick TWAS weights.
SuSiE searches for up to five independent signals per region; if all five are found, the cap is raised. SuSiE itself does not accept covariates, so covariates are regressed out of both the phenotype and the genotype matrix beforehand.
What is a “region”?#
A region is one row in the phenotype manifest (--phenoFile) plus its association
window (from --customized-association-windows, or a symmetric --cis-window around
start/end). One susie_twas call fits one SuSiE + TWAS model per region.
Gene-based phenotypes. Manifest row = one gene; window = gene’s containing TAD (or ± cis-window around the TSS).
Peak-based phenotypes (caQTL, mQTL). Manifest row = one peak/CpG; window = the peak’s containing extended-TAD. A peak is 2–5 kb and cannot serve as its own window.
Step 1 – Prepare the phenotype manifest and association-windows BED#
The pipeline expects two derived files per experiment:
pheno_manifest.tsv— 5 columns (#chr start end ID path), one row per region.pathpoints to a bgzipped+tabixed phenotype BED restricted to that region.association_windows.bed— 4 columns (chr start end ID), one row per region.start/enddefine the cis search window;IDmust match a row in the manifest.
For gene-based phenotypes these files are produced by the standard phenotype preprocessing + TADB generation steps.
For peak-based phenotypes (caQTL, mQTL) you can derive both from the raw peak BED with the helper below; it renames the peak-ID header, splits the multi-sample BED into one bgz+tbi per peak, writes the manifest, and annotates each peak with its smallest containing extended-TAD.
bash code/misc/preprocess/prepare_peak_inputs.sh \
<raw_peaks.bed.gz> \
<extended_TADB.bed> \
<out_dir>
# Produces:
# <out_dir>/pheno_manifest.tsv
# <out_dir>/association_windows.bed
# <out_dir>/peaks_split/<ID>.bed.gz{,.tbi}
Step 2 – Run susie_twas#
Gene-based example (3 genes, default bulk-RNA-seq preprocessing):
sos run pipeline/mnm_regression.ipynb susie_twas \
--name test_susie_twas \
--cwd output/mnm_regression/susie_twas \
--genoFile output/genotype_by_chrom/wgs.merged.plink_qc.1.bed \
--phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \
--covFile output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
--customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \
--phenotype-names test_pheno \
--max-cv-variants 5000 \
--ld_reference_meta_file data/ld_meta_file_with_bim.tsv \
--region-name ENSG00000049246 ENSG00000054116 ENSG00000116678 \
--save-data
Peak-based example (caQTL; 10 chr22 peaks, 1,287 samples):
sos run pipeline/mnm_regression.ipynb susie_twas \
--name example_10peaks_twas \
--cwd output/susie_twas_out \
--genoFile input/example.chr22.bed \
--phenoFile output/inputs_ready/pheno_manifest.tsv \
--covFile input/example_covariates.tsv \
--customized-association-windows output/inputs_ready/association_windows.bed \
--phenotype-names caQTL \
--max-cv-variants 5000 \
--save-data
Expected log (both cases):
INFO: Running get_analysis_regions:
Loading customized association analysis window from <windows_bed>
INFO: get_analysis_regions is completed.
INFO: get_analysis_regions output: regional_data
INFO: Running susie_twas:
INFO: susie_twas (index=0) is completed.
INFO: susie_twas (index=1) is completed.
...
INFO: susie_twas output: <cwd>/fine_mapping/<name>.<chrom>_<ID>.univariate_bvsr.rds
<cwd>/twas_weights/<name>.<chrom>_<ID>.univariate_twas_weights.rds
... (2 × N_regions items)
INFO: Workflow susie_twas is executed successfully.
Command Interface#
sos run pipeline/mnm_regression.ipynb susie_twas -h
Flags consumed by susie_twas:
--genoFile— path to a PLINK1.bedfile (pass the.bedpath, not the prefix). FID/IID in the.fammust match the phenotype/covariate sample IDs.--phenoFile— 5-column#chr start end ID pathmanifest. One row per region;pathpoints at a bgzipped+tabixed phenotype BED restricted to that region. Pass multiple files for multi-condition analyses.--covFile— tab-delimited, literalIDheader in column 1; covariate names in rows, sample IDs across the header row.--customized-association-windows— 4-columnchr start end IDBED. Required unless--cis-windowis non-negative.--cis-window— symmetric cis radius (bp) around each region’sstart/end. Default-1(forces the customized-windows path).--phenotype-names— condition name(s), one per--phenoFile.--max-cv-variants— cap on variants used in cross-validation (default 5,000).--ld_reference_meta_file— summary-stats / RSS mode only; omit for individual-level.--region-name— restrict to one or more region IDs for sanity checks.--save-data— also emit*.univariate_data.rdswith the residualized X/Y matrices.
Output format#
Per region the workflow writes two RDS files (three with --save-data), keyed by
region ID and phenotype condition.
<name>.<chrom>_<ID>.univariate_bvsr.rds#
variant_names— variant IDs kept after QC.analysis_script— the R script that wassource()d (provenance).other_quantities$dropped_samples— per-matrix sample drops.sumstats—betahat,sebetahat,z_scores,p_values,q_values.sample_names— samples actually used.top_loci— data frame:variant_id, betahat, sebetahat, z, maf, pip, cs_coverage_0.95 / 0.7 / 0.5.susie_result_trimmed— trimmed SuSiE fit:pip, sets, cs_corr, sets_secondary, alpha, lbf_variable, V, niter, X_column_scale_factors, mu, mu2.total_time_elapsed—proc_time.region_info—region_coord(chrom/start/end),grange(expanded window),region_name(the manifestID).preset_variants_result— SuSiE fit on the preset-variant subset used for TWAS.
<name>.<chrom>_<ID>.univariate_twas_weights.rds#
susie_weights_intermediate—mu, lbf_variable, X_column_scale_factors, pipon the preset variant set.twas_weights— per-method weights:enet_weights, lasso_weights, bayes_r_weights, bayes_l_weights, mrash_weights, susie_weights.twas_predictions— same six methods, each a (sample × condition) matrix.twas_cv_result—sample_partition(fold assignment),prediction,performance(per method:corr, rsq, adj_rsq, pval, ...),time_elapsed.total_time_elapsed,variant_names,region_info.
<name>.<chrom>_<ID>.univariate_data.rds (--save-data)#
residual_Y,residual_X— phenotype/genotype residualized for covariates.residual_Y_scalar,residual_X_scalar— rescaling factors.dropped_sample— per-matrix drops.maf— named numeric (variant_id → maf).X— raw genotype matrix (sample × variant).chrom,grange,X_variance.
See load_regional_association_data in pecotmr
for the full struct.
Expected Results#
Quick sanity check — inspect cross-validation performance for the first region:
Rscript -e '
fp <- "output/mnm_regression/susie_twas/twas_weights/test_susie_twas.chr1_ENSG00000049246.univariate_twas_weights.rds"
x <- readRDS(fp)
cat("methods:", names(x[[1]][[1]]$twas_weights), "\n")
print(x[[1]][[1]]$twas_cv_result$performance)
'
Interpretation:
twas_weights$<method>— use for TWAS (Z ≈ weights · LD · GWAS z-scores).twas_cv_result$performance— pick the method with highestadj_rsqbefore applying.top_lociinunivariate_bvsr.rds— lead variants and credible-set coverages.susie_result_trimmed$sets$cs— indices of variants in each 95% credible set.
Common pitfalls#
Sample-ID drift. The phenotype BED’s header columns (5+), the bfile FID/IID, and the covariate TSV header must all be drawn from the same set. Sos does not warn on missing samples — it silently drops them and may report zero overlap.
Empty regions. If a region has no variants in its association window (e.g. the manifest points at a window with no bfile coverage) sos marks it completed with an empty output. Verify the expected file count in the
INFO: susie_twas output:line.--cis-windowvs--customized-association-windows. Either/or. If both are passed the customized windows win. Pass--cis-window -1to force the customized path explicitly.