Univariate Fine-Mapping and TWAS with SuSiE#

Overview#

This pipeline performs per-region univariate fine-mapping with SuSiE and computes TWAS weights, via the susie_twas workflow of pipeline/mnm_regression.ipynb. For each region the workflow fits SuSiE on the covariate-residualized phenotype × genotype matrix and then cross-validates five prediction models (enet, lasso, bayes_r, mrash, susie) to pick TWAS weights.

SuSiE searches for up to five independent signals per region; if all five are found, the cap is raised. SuSiE itself does not accept covariates, so covariates are regressed out of both the phenotype and the genotype matrix beforehand.

What is a “region”?#

A region is one row in the phenotype manifest (--phenoFile) plus its association window (from --customized-association-windows, or a symmetric --cis-window around start/end). One susie_twas call fits one SuSiE + TWAS model per region.

  • Gene-based phenotypes. Manifest row = one gene; window = gene’s containing TAD (or ± cis-window around the TSS).

  • Peak-based phenotypes (caQTL, mQTL). Manifest row = one peak/CpG; window = the peak’s containing extended-TAD. A peak is 2–5 kb and cannot serve as its own window.

Step 1 – Prepare the phenotype manifest and association-windows BED#

The pipeline expects two derived files per experiment:

  • pheno_manifest.tsv — 5 columns (#chr start end ID path), one row per region. path points to a bgzipped+tabixed phenotype BED restricted to that region.

  • association_windows.bed — 4 columns (chr start end ID), one row per region. start/end define the cis search window; ID must match a row in the manifest.

For gene-based phenotypes these files are produced by the standard phenotype preprocessing + TADB generation steps.

For peak-based phenotypes (caQTL, mQTL) you can derive both from the raw peak BED with the helper below; it renames the peak-ID header, splits the multi-sample BED into one bgz+tbi per peak, writes the manifest, and annotates each peak with its smallest containing extended-TAD.

bash code/misc/preprocess/prepare_peak_inputs.sh \
    <raw_peaks.bed.gz> \
    <extended_TADB.bed> \
    <out_dir>

# Produces:
#   <out_dir>/pheno_manifest.tsv
#   <out_dir>/association_windows.bed
#   <out_dir>/peaks_split/<ID>.bed.gz{,.tbi}

Step 2 – Run susie_twas#

Gene-based example (3 genes, default bulk-RNA-seq preprocessing):

sos run pipeline/mnm_regression.ipynb susie_twas \
    --name        test_susie_twas \
    --cwd         output/mnm_regression/susie_twas \
    --genoFile    output/genotype_by_chrom/wgs.merged.plink_qc.1.bed \
    --phenoFile   output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \
    --covFile     output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --phenotype-names test_pheno \
    --max-cv-variants 5000 \
    --ld_reference_meta_file data/ld_meta_file_with_bim.tsv \
    --region-name ENSG00000049246 ENSG00000054116 ENSG00000116678 \
    --save-data

Peak-based example (caQTL; 10 chr22 peaks, 1,287 samples):

sos run pipeline/mnm_regression.ipynb susie_twas \
    --name        example_10peaks_twas \
    --cwd         output/susie_twas_out \
    --genoFile    input/example.chr22.bed \
    --phenoFile   output/inputs_ready/pheno_manifest.tsv \
    --covFile     input/example_covariates.tsv \
    --customized-association-windows output/inputs_ready/association_windows.bed \
    --phenotype-names caQTL \
    --max-cv-variants 5000 \
    --save-data

Expected log (both cases):

INFO: Running get_analysis_regions:
Loading customized association analysis window from <windows_bed>
INFO: get_analysis_regions is completed.
INFO: get_analysis_regions output:   regional_data
INFO: Running susie_twas:
INFO: susie_twas (index=0) is completed.
INFO: susie_twas (index=1) is completed.
...
INFO: susie_twas output:   <cwd>/fine_mapping/<name>.<chrom>_<ID>.univariate_bvsr.rds
                           <cwd>/twas_weights/<name>.<chrom>_<ID>.univariate_twas_weights.rds
                           ...  (2 × N_regions items)
INFO: Workflow susie_twas is executed successfully.

Command Interface#

sos run pipeline/mnm_regression.ipynb susie_twas -h

Flags consumed by susie_twas:

  • --genoFile — path to a PLINK1 .bed file (pass the .bed path, not the prefix). FID/IID in the .fam must match the phenotype/covariate sample IDs.

  • --phenoFile — 5-column #chr start end ID path manifest. One row per region; path points at a bgzipped+tabixed phenotype BED restricted to that region. Pass multiple files for multi-condition analyses.

  • --covFile — tab-delimited, literal ID header in column 1; covariate names in rows, sample IDs across the header row.

  • --customized-association-windows — 4-column chr start end ID BED. Required unless --cis-window is non-negative.

  • --cis-window — symmetric cis radius (bp) around each region’s start/end. Default -1 (forces the customized-windows path).

  • --phenotype-names — condition name(s), one per --phenoFile.

  • --max-cv-variants — cap on variants used in cross-validation (default 5,000).

  • --ld_reference_meta_file — summary-stats / RSS mode only; omit for individual-level.

  • --region-name — restrict to one or more region IDs for sanity checks.

  • --save-data — also emit *.univariate_data.rds with the residualized X/Y matrices.

Output format#

Per region the workflow writes two RDS files (three with --save-data), keyed by region ID and phenotype condition.

<name>.<chrom>_<ID>.univariate_bvsr.rds#

  • variant_names — variant IDs kept after QC.

  • analysis_script — the R script that was source()d (provenance).

  • other_quantities$dropped_samples — per-matrix sample drops.

  • sumstatsbetahat, sebetahat, z_scores, p_values, q_values.

  • sample_names — samples actually used.

  • top_loci — data frame: variant_id, betahat, sebetahat, z, maf, pip, cs_coverage_0.95 / 0.7 / 0.5.

  • susie_result_trimmed — trimmed SuSiE fit: pip, sets, cs_corr, sets_secondary, alpha, lbf_variable, V, niter, X_column_scale_factors, mu, mu2.

  • total_time_elapsedproc_time.

  • region_inforegion_coord (chrom/start/end), grange (expanded window), region_name (the manifest ID).

  • preset_variants_result — SuSiE fit on the preset-variant subset used for TWAS.

<name>.<chrom>_<ID>.univariate_twas_weights.rds#

  • susie_weights_intermediatemu, lbf_variable, X_column_scale_factors, pip on the preset variant set.

  • twas_weights — per-method weights: enet_weights, lasso_weights, bayes_r_weights, bayes_l_weights, mrash_weights, susie_weights.

  • twas_predictions — same six methods, each a (sample × condition) matrix.

  • twas_cv_resultsample_partition (fold assignment), prediction, performance (per method: corr, rsq, adj_rsq, pval, ...), time_elapsed.

  • total_time_elapsed, variant_names, region_info.

<name>.<chrom>_<ID>.univariate_data.rds (--save-data)#

  • residual_Y, residual_X — phenotype/genotype residualized for covariates.

  • residual_Y_scalar, residual_X_scalar — rescaling factors.

  • dropped_sample — per-matrix drops.

  • maf — named numeric (variant_id maf).

  • X — raw genotype matrix (sample × variant).

  • chrom, grange, X_variance.

See load_regional_association_data in pecotmr for the full struct.

Expected Results#

Quick sanity check — inspect cross-validation performance for the first region:

Rscript -e '
fp <- "output/mnm_regression/susie_twas/twas_weights/test_susie_twas.chr1_ENSG00000049246.univariate_twas_weights.rds"
x  <- readRDS(fp)
cat("methods:", names(x[[1]][[1]]$twas_weights), "\n")
print(x[[1]][[1]]$twas_cv_result$performance)
'

Interpretation:

  • twas_weights$<method> — use for TWAS (Z weights · LD · GWAS z-scores).

  • twas_cv_result$performance — pick the method with highest adj_rsq before applying.

  • top_loci in univariate_bvsr.rds — lead variants and credible-set coverages.

  • susie_result_trimmed$sets$cs — indices of variants in each 95% credible set.

Common pitfalls#

  • Sample-ID drift. The phenotype BED’s header columns (5+), the bfile FID/IID, and the covariate TSV header must all be drawn from the same set. Sos does not warn on missing samples — it silently drops them and may report zero overlap.

  • Empty regions. If a region has no variants in its association window (e.g. the manifest points at a window with no bfile coverage) sos marks it completed with an empty output. Verify the expected file count in the INFO: susie_twas output: line.

  • --cis-window vs --customized-association-windows. Either/or. If both are passed the customized windows win. Pass --cis-window -1 to force the customized path explicitly.