Advanced regression models for association analysis with individual-level data#
This is a new-user-friendly minimal working example (MWE) of the mnm_regression pipeline. It runs end-to-end on the bundled chr22 toy dataset (60 samples) and shows how to compute SuSiE fine-mapping results and multi-method TWAS weights for a list of molecular-phenotype regions, then build the ensemble (multi-context) TWAS weights.
Prerequisites#
Requires processed genotype (PLINK/VCF) and phenotype files from data_preprocessing. For TWAS, also requires the LD reference generated by rss_ld_sketch.
Description#
This pipeline fits advanced statistical models for QTL association analysis using individual-level genotype and molecular-phenotype data. For each region in a phenotype list it runs univariate SuSiE fine-mapping and trains TWAS prediction weights with several methods, then combines them into an ensemble weight.
The primary example below uses gene expression as the molecular phenotype, running end-to-end on the same chr22 toy genotype:
Gene-expression input (
input/colocboost/, Mode A): the phenotype is a bulk RNA-seq matrix with Ensemblgene_ids; association windows are gene TSS +/- 500 kb. This is the main, fully verified example.Peak / chromatin input (
input/, Mode B): the same pipeline also accepts a list of chr22 peaks (or any other molecular phenotype) - the command is identical, you just swap in the phenotype BED and its manifest.
Workflows used here:
susie_twas– univariate SuSiE fine-mapping plus multi-method TWAS weights (one*.univariate_twas_weights.rdsper region). Run with--save-dataso the combined fine-mapping object is written for the next step.mnm– multi-context fine-mapping (mvSuSiE + mr.mash) that consumes thesusie_twas --save-dataoutput through a--fine-mapping-metatable and produces the ensemble weights (one*.multicontext_twas_weights.rdsper gene).
The notebook also documents the mnm_genes (multi-gene), mvfsusie/fsusie (functional), and mvsusie (multivariate) workflows that the pipeline supports, but the MWE focuses on the univariate (susie_twas) + multi-context (mnm) path that is runnable on the toy data.
Note on data: this example uses a small synthetic chr22 toy dataset (60 samples, 1,995 variants after subsetting). No access-controlled individual-level human data is used.
Input Files#
All paths below are relative to the toy_example/ root (where you launch the commands).
Shared genotype (both modes):
Genotype (
--genoFile): PLINK1 binary set. Pass the.bedfile; the.bim/.famare read automatically. Gene-expression mode (input/colocboost/) usesexample.chr22.{bed,bim,fam}; peak mode (input/) usesprotocol_example.genotype.chr22.{bed,bim,fam}(chr22, 1,995 variants).Covariates (
--covFile): a TSV with covariate rows and one column per sample (gene-expression mode:input/colocboost/example_covariates.tsv; peak mode:input/covariate/protocol_example.covariates.tsv).
Gene-expression mode (input/colocboost/) - primary example:
Phenotype list (
--phenoFile):input/colocboost/pheno_manifest_1gene.tsv(a single gene, fastest) orinput/colocboost/pheno_manifest.tsv(16 in-range chr22 genes). Thepathcolumn points to the bgzipped + tabixed RNA-seq matrixinput/colocboost/example/colocboost.bed.gzwhose 4th column is the Ensemblgene_id.Association windows (
--customized-association-windows):input/colocboost/association_windows.bed(or_1gene.bed) – gene TSS +/- 500 kb, clamped to the genotype variant range.
Peak mode (input/) - same format, alternative input:
Phenotype list (
--phenoFile):input/phenotype/protocol_example.pheno_manifest.tsv– 5 columns#chr start end ID path, one row per peak (10 peaks). Eachpathpoints to a bgzipped per-peak phenotype BED underinput/proteomics/protocol_example.peaks_split/(samples use the anonymized genotype IDs, e.g.S0561, matching the shared PLINK.fam).Association windows (
--customized-association-windows):input/finemapping/protocol_example.association_windows.bed– the cis window to analyze for each peakID.
Key requirement: the 4th-column ID of the association-windows file must match the 4th-column ID of the phenotype list, otherwise the window for that region is not found and the region is skipped.
Steps#
Run the two steps in order. Step 1 (susie_twas) produces the per-region univariate TWAS weights; Step 2 (mnm) reads those weights and adds the multi-context ensemble. The pipeline notebook is referenced as ../../pipeline/mnm_regression.ipynb.
Step 1. SuSiE fine-mapping + multi-method TWAS weights for every region in the phenotype list. Run with --save-data so the combined fine-mapping object is written for Step 2.
Step 2. Build the ensemble (multi-context) TWAS weight from the Step-1 output with mnm. The ensemble is computed by default on the toy data (the ensemble_r2_threshold is relaxed so no method is silently dropped).
Memory note: run multi-region loops with -j1. Running many regions in parallel (-j4) – or even all of them serially in one process – can exhaust memory on heavier regions and trigger ERROR: One of the local workers has been killed. If that happens, re-run the remaining heavy regions one at a time using a single-region phenotype list.
Mode A: gene-expression input (primary example)#
This is the main example: cis xQTL fine-mapping on gene expression. Two steps, pointed at the input/colocboost/ files. Use the 1-gene phenotype list for the fastest run; swap in pheno_manifest.tsv / association_windows.bed to analyze all 16 in-range genes.
Step 1 (gene expression): SuSiE + TWAS weights.
Timing (toy chr22 dataset, 60 samples):
susie_twas: ~3 min (chr22 toy, 1 gene)mnm: ~30 sec
sos run pipeline/mnm_regression.ipynb qtl_dataset_construct+susie_twas \
--name protocol_example --cwd output_uni \
--genoFile input/colocboost/example.chr22.bed \
--phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
--covFile input/colocboost/example_covariates.tsv \
--customized-association-windows input/colocboost/association_windows.bed \
--region-name ENSG00000130538 --transpose-covariates --save-data -j1
Step 2 (gene expression): multi-context fine-mapping + ensemble TWAS weights (mnm).
sos run pipeline/mnm_regression.ipynb mnm \
--name protocol_example --cwd output_mnm \
--genoFile input/colocboost/example.chr22.bed \
--phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
--covFile input/colocboost/example_covariates.tsv \
--customized-association-windows input/colocboost/association_windows.bed \
--fine-mapping-meta input/colocboost/fine_mapping_meta.tsv \
--transpose-covariates --save-data -j1
Mode B: peak / other molecular phenotypes (same pipeline, different input)#
The same pipeline also accepts peaks (chromatin/epigenomic) or any other molecular phenotype - just swap in the phenotype .bed.gz and its manifest; the command is otherwise identical to Mode A. Example below uses the 10 toy chr22 peaks.
Step 1 (peak): SuSiE + TWAS weights for the 10 peaks.
sos run pipeline/mnm_regression.ipynb susie_twas \
--name protocol_example \
--cwd output_10peaks \
--genoFile input/genotype/protocol_example.genotype.chr22.bed \
--phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \
--covFile input/covariate/protocol_example.covariates.tsv \
--customized-association-windows input/finemapping/protocol_example.association_windows.bed \
-j1
Step 2 (peak): multi-context fine-mapping + ensemble TWAS weights (mnm).
# Step 2 (10 peaks): multi-context fine-mapping + ensemble TWAS weights (mnm).
# As above, mnm consumes the Step-1 univariate outputs via --fine-mapping-meta.
sos run pipeline/mnm_regression.ipynb mnm \
--name protocol_example \
--cwd output_10peaks \
--genoFile input/genotype/protocol_example.genotype.chr22.bed \
--phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \
--covFile input/covariate/protocol_example.covariates.tsv \
--customized-association-windows input/finemapping/protocol_example.association_windows.bed \
--fine-mapping-meta output_10peaks/fine_mapping_meta.tsv \
-j1
Output Files#
Under --cwd (e.g. output_10peaks/ or output_uni/):
twas_weights/{name}.<region>.univariate_twas_weights.rds– Step 1 (susie_twas) output, one per region: the SuSiE fine-mapping result plus TWAS weights from each method (e.g. lasso, enet, mrash, susie), variant names, region info and CV performance.multivariate_twas_weights/{name}.<region>.multicontext_twas_weights.rds– Step 2 (mnm) output, one per gene: the multi-context model with the added ensemble. The ensemble is stored under anensembleentry alongsidemethod_coef, the ensemble weights, andmethod_performance.
Fine-mapping objects are also written under the fine_mapping/ subfolder, and per-region .stdout/.stderr logs accompany each run.
Anticipated Results#
For the 10-peak run you should get 10 *.univariate_twas_weights.rds (one per peak) and 10 *.multicontext_twas_weights.rds from the mnm step. For the 1-gene gene-expression run you get one file of each type.
You can sanity-check a multi-context ensemble weight in R:
# In R: inspect one multi-context (mnm) ensemble output
w <- readRDS('output_uni/multivariate_twas_weights/protocol_example.ENSG00000130538.multicontext_twas_weights.rds')
names(w) # available genes / contexts
str(w, max.level = 2) # weights + per-method performance
Command interface#
Show all workflows and options of the pipeline:
sos run pipeline/mnm_regression.ipynb -h
Other workflows#
The same pipeline notebook also provides:
mvsusie– multivariate SuSiE with mr.mash TWAS weights (for jointly analyzing multiple phenotypes that share a genotype).mnm_genes– multi-gene multivariate fine-mapping across genes sharing a window.fsusie– functional SuSiE for functional/high-resolution modalities (e.g. methylation), with a--fsusie-post-processingoption (TI,HMM, ornone;TIis recommended).
These follow the same --genoFile / --phenoFile / --covFile interface; they are documented here for completeness but the verified MWE above uses susie_twas + mnm.
The notebook also bundles the original protocol’s multivariate (mnm, mnm_genes) and functional (fsusie, mvfsusie) fine-mapping workflows, which operate on the same regional_data produced by get_analysis_regions. Of these, susie_twas (univariate), fsusie, mnm, and mnm_genes now run end-to-end on the toy data (mnm and mnm_genes use a synthetic second context, see below). mnm_genes consumes the univariate fine-mapping summary statistics (combined_data_sumstats) written by susie_twas --save-data together with the mnm outputs, wired together through a --fine_mapping_meta table; on the toy set it correctly returns NULL multigene results (no multi-gene/trans signal). mvfsusie remains still under development in the original protocol, so its command is given as a template for completeness.
Multivariate analysis: mvSuSiE and mr.mash (mnm)#
Fine-maps multiple molecular phenotypes jointly across a shared window using mvSuSiE with mr.mash priors. This step needs multiple contexts (e.g. the same genes measured under two or more conditions). The toy set ships a single context, so for this MWE we add a second, synthetic context (example/colocboost_ctx2.bed.gz, derived from context 1 with added noise — demo data only, not a real measurement) and a multi-context phenotype manifest. With that in place the mnm step now runs end-to-end on the toy data and produces multicontext_bvsr.rds, multicontext_twas_weights.rds, and multicontext_data.rds per gene.
# Runs end-to-end on the toy dataset using a synthetic 2-context manifest.
# Notes for new users:
# - mnm needs >=2 contexts; pheno_manifest_multicontext.tsv supplies context1 (real)
# and context2 (example/colocboost_ctx2.bed.gz, synthetic demo data).
# - Use the gene-expression genotype (input/colocboost/example.chr22.bed, sample IDs
# SAMPLE_001..) so genotype and phenotype/covariate samples match.
# - Requires the mr.mashr R package (provided by mr.mash.alpha in this env).
sos run pipeline/mnm_regression.ipynb mnm \
--name protocol_example --cwd output_mnm3 \
--genoFile input/colocboost/example.chr22.bed \
--phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
--covFile input/colocboost/example_covariates.tsv \
--customized-association-windows input/colocboost/association_windows.bed \
--save-data -j1
xQTL fine-mapping across multiple genes (mnm_genes)#
Multivariate fine-mapping treating each gene as a separate condition across a shared window. Verified end-to-end on the toy dataset using the univariate (susie_twas --save-data) and mnm outputs wired together through a --fine_mapping_meta table. On the toy set every region returns a NULL multigene_bvsr (no multi-gene/trans signal), which is the correct, honest result for these independent single-gene regions.
# Multi-gene / trans fine-mapping (mnm_genes) - VERIFIED end-to-end on the toy data.
# Run from the toy_example root. Requires:
# --pheno_id_map_file : maps phenotype IDs across the two contexts
# --fine_mapping_meta : per-region table pointing to the univariate (susie_twas, run above
# with --save-data) and multivariate (mnm) fine-mapping outputs
# --keep-samples : sample IDs to retain (matches the gene-expression genotype)
# On this toy set all regions return a NULL multigene_bvsr (no multi-gene/trans signal),
# which is the correct, honest result for independent single-gene toy regions.
sos run pipeline/mnm_regression.ipynb mnm_genes \
--name protocol_example --cwd output_mnmgenes \
--genoFile input/colocboost/example.chr22.bed \
--phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
--covFile input/colocboost/example_covariates.tsv \
--customized-association-windows input/colocboost/association_windows.bed \
--pheno_id_map_file input/colocboost/pheno_id_map.tsv \
--fine_mapping_meta input/colocboost/fine_mapping_meta.tsv \
--keep-samples input/colocboost/keep_samples.txt \
--save-data -j1
Functional regression fSuSiE for epigenomic QTL fine-mapping (fsusie)#
Functional SuSiE for fine-mapping a functional phenotype across genomic positions in a region. This workflow runs end-to-end on the toy peak data and produces a *.fsusie_mixture_normal_TI__top_pc_weights.rds weight file.
# Functional SuSiE (fsusie) - VERIFIED end-to-end on the toy peak data.
# Run from the toy_example root. Produces *.fsusie_mixture_normal_TI__top_pc_weights.rds.
sos run pipeline/mnm_regression.ipynb fsusie \
--name protocol_example --cwd output/fsusie \
--genoFile input/genotype/protocol_example.genotype.chr22.bed \
--phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \
--covFile input/covariate/protocol_example.covariates.tsv \
--customized-association-windows input/finemapping/protocol_example.association_windows.bed \
--cis-window 0 --max-cv-variants 5000 --save-data -j1
Functional regression fSuSiE with other modality (mvfsusie)#
Multivariate functional SuSiE (mvfSuSiE). Still under development for the toy dataset.
# Still under development on the toy dataset (the mvfSuSiE R step errors out).
# Command template:
sos run pipeline/mnm_regression.ipynb mvfsusie \
--name protocol_example --cwd output/mvfsusie \
--genoFile input/genotype/protocol_example.genotype.chr22.bed \
--phenoFile input/phenotype/protocol_example.pheno_manifest.tsv \
--covFile input/covariate/protocol_example.covariates.tsv \
--customized-association-windows input/finemapping/protocol_example.association_windows.bed \
--save-data -j1
Troubleshooting#
Step |
Problem |
Possible Reason |
Solution |
|---|---|---|---|
susie_twas |
A region is skipped / no window found |
4th-column |
Make the |
susie_twas |
|
No region selected |
Pass |
susie_twas |
|
Out-of-memory: too many regions in one process ( |
Use |
mnm |
|
|
Chain it: |
mnm |
|
Single-context manifest used |
Use the multi-context manifest ( |
both |
|
Covariates are in QTLtools (transposed) layout |
Add |
both |
Sample IDs do not overlap between genotype/covariates and phenotype |
Genotype |
Rename one side so they match (e.g. the gene-expression mode renames genotype/covariate IDs to |
Workflow implementation#
The runnable SoS workflow sections below are carried over verbatim from the original mnm_regression.ipynb. Sections still under active development are marked as WIP placeholders and are not part of the supported MWE.
[global]
parameter: modular_script_dir = path('code/script') # override with --modular-script-dir
# It is required to input the name of the analysis
parameter: name = str
parameter: cwd = path("output")
# A list of file paths for genotype data, or the genotype data itself.
parameter: genoFile = path
# One or multiple lists of file paths for phenotype data.
parameter: phenoFile = paths
# One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.
parameter: phenoIDFile = paths()
# Covariate file path
parameter: covFile = paths
# Optional: if a region list is provide the analysis will be focused on provided region.
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# Only focus on a subset of samples
parameter: keep_samples = path()
# Only focus on a subset of variants
parameter: keep_variants = path()
# An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_association_windows = path()
# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# When this is set to negative, we will rely on using customized_association_windows
parameter: cis_window = -1
# save data object or not
parameter: save_data = False
# Name of phenotypes
parameter: phenotype_names = [f'{x:bn}' for x in phenoFile]
# And indicator whether it is trans-analysis, ie, not using phenotypic coordinate information
parameter: trans_analysis = False
parameter: seed = 999
# association analysis paramters
# initial number of single effects for SuSiE
parameter: init_L = 8
# maximum number of single effects to use for SuSiE
parameter: L = 30
parameter: estimate_residual_variance = True
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 1.0
# MAF and variance of X cutoff
parameter: maf = 0.0025
parameter: xvar_cutoff = 0.0
# MAC cutoff, on top of MAF cutoff
parameter: mac = 5
# Remove indels if indel = False
parameter: indel = True
parameter: pip_cutoff = 0.025
parameter: coverage = [0.95, 0.7, 0.5]
# If this value is not 0, then an initial single effect analysis will be performed
# to determine if follow up analysis will be continued or to simply return NULL
# If this is negative we use a default way to determine this cutoff which is conservative but still useful
parameter: skip_analysis_pip_cutoff = []
# Skip fine-mapping
parameter: skip_fine_mapping = False
# Skip TWAS weights computation
parameter: skip_twas_weights = False
# Perform K folds valiation CV for TWAS
# Set it to zero if this is to be skipped
parameter: twas_cv_folds = 5
parameter: twas_cv_threads = twas_cv_folds
# maximum number of variants to consider for CV
# We will randomly pick a subset of it for CV purpose
# We can set it to eg 8000 to save computational burden althought may risk overfitting for methods comparison purpose
# When set to -1 we don't use this feature
parameter: max_cv_variants = -1
parameter: ld_reference_meta_file = path()
# Analysis environment settings
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "1h"
# Memory expected
parameter: mem = "20G"
# Number of threads
parameter: numThreads = 1
if len(phenoFile) != len(covFile):
raise ValueError("Number of input phenotypes files must match that of covariates files")
if len(phenoFile) != len(phenotype_names):
raise ValueError("Number of input phenotypes files must match the number of phenotype names")
if len(phenoIDFile) > 0 and len(phenoFile) != len(phenoIDFile):
raise ValueError("Number of input phenotypes files must match the number of phenotype ID mapping files")
if len(skip_analysis_pip_cutoff) == 0:
skip_analysis_pip_cutoff = [0.0] * len(phenoFile)
if len(skip_analysis_pip_cutoff) == 1:
skip_analysis_pip_cutoff = skip_analysis_pip_cutoff * len(phenoFile)
if len(skip_analysis_pip_cutoff) != len(phenoFile):
raise ValueError(f"``skip_analysis_pip_cutoff`` should have either length 1 or length the same as phenotype files ({len(phenoFile)} in this case)")
# make it into an R List string
skip_analysis_pip_cutoff = [f"'{y}'={x}" for x,y in zip(skip_analysis_pip_cutoff, phenotype_names)]
def group_by_region(lst, partition):
# from itertools import accumulate
# partition = [len(x) for x in partition]
# Compute the cumulative sums once
# cumsum_vector = list(accumulate(partition))
# Use slicing based on the cumulative sums
# return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
return partition
import os
import pandas as pd
def adapt_file_path(file_path, reference_file):
"""
Adapt a single file path based on its existence and a reference file's path.
Args:
- file_path (str): The file path to adapt.
- reference_file (str): File path to use as a reference for adaptation.
Returns:
- str: Adapted file path.
Raises:
- FileNotFoundError: If no valid file path is found.
"""
reference_path = os.path.dirname(reference_file)
# Check if the file exists
if os.path.isfile(file_path):
return file_path
# Check file name without path
file_name = os.path.basename(file_path)
if os.path.isfile(file_name):
return file_name
# Check file name in reference file's directory
file_in_ref_dir = os.path.join(reference_path, file_name)
if os.path.isfile(file_in_ref_dir):
return file_in_ref_dir
# Check original file path prefixed with reference file's directory
file_prefixed = os.path.join(reference_path, file_path)
if os.path.isfile(file_prefixed):
return file_prefixed
# If all checks fail, raise an error
raise FileNotFoundError(f"No valid path found for file: {file_path}")
def adapt_file_path_all(df, column_name, reference_file):
return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))
[qtl_dataset_construct]
# Build one pecotmr::QtlDataset from the phenotype manifest + shared genotype and
# serialize to RDS, replacing the legacy get_analysis_regions / regional_data
# loader. No fan-out: downstream per-gene fine-mapping / TWAS load this single
# RDS and select contexts/genes at analysis time.
#
# Manifest schema (qtl_dataset_construct.R): columns ID, path, cond and an
# optional cov_path; extra columns (e.g. #chr/start/end) are ignored. The toy
# MWE pheno_manifest_multicontext.tsv already matches -- a single-context
# analysis just points at a one-`cond` manifest (or selects the context
# downstream). QTLtools-format covariate TSVs (rows = covariates) need
# --transpose-covariates.
parameter: study = name
# Set for QTLtools-format covariate TSVs (rows = covariates, cols = samples);
# required for the toy MWE covariates.
parameter: transpose_covariates = False
# Optional genotype-derived covariates applied uniformly across contexts;
# per-context covariates come from the manifest cov_path column.
parameter: genotype_covariates = path('.')
# Construct-time variant/sample QC (stored on the QtlDataset; cutoff 0 = off).
parameter: maf_cutoff = 0.0
parameter: xvar_cutoff = 0.0
parameter: mac_cutoff = 0.0
parameter: imiss_cutoff = 0.0
# Drop indel variants (default keeps them, matching QtlDataset()).
parameter: drop_indel = False
# Optional whitespace-delimited ID files to restrict samples / variants ("." = none).
parameter: keep_samples = "."
parameter: keep_variants = "."
output: f"{cwd:a}/qtl_dataset/{study}.qtl_dataset.rds"
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f"{step_name}_{_output:bn}"
bash: expand = '${ }', stderr = f"{_output}.stderr", stdout = f"{_output}.stdout", container = container
Rscript ${modular_script_dir}/pecotmr_integration/qtl_dataset_construct.R \
--study ${study} \
--genotype-prefix ${genoFile:n} \
--phenotype-manifest ${phenoFile[0]} \
--genotype-covariates ${genotype_covariates if genotype_covariates.is_file() else '""'} \
${'--transpose-covariates' if transpose_covariates else ''} \
--maf-cutoff ${maf_cutoff} \
--xvar-cutoff ${xvar_cutoff} \
--mac-cutoff ${mac_cutoff} \
--imiss-cutoff ${imiss_cutoff} \
${'--drop-indel' if drop_indel else ''} \
${('--keep-samples ' + str(keep_samples)) if keep_samples != '.' else ''} \
${('--keep-variants ' + str(keep_variants)) if keep_variants != '.' else ''} \
--output ${_output}
[susie_twas]
# Univariate SuSiE fine-mapping + TWAS weights over the pre-built QtlDataset
# (from the qtl_dataset_construct step), one gene per fan-out unit. The gene(s)
# to analyze are passed on the command line via --region-name; the QtlDataset
# RDS already holds the genotype + per-context phenotypes, so this step neither
# re-reads the phenotype manifest nor rebuilds regional_data. Fine-mapping runs
# first (pecotmr_integration/fine_mapping.R -> fineMappingPipeline); the TWAS
# weights pass then reuses its SuSiE fits via --fine-mapping-result
# (pecotmr_integration/twas_weights.R -> twasWeightsPipeline). Replaces the
# legacy regional_data + susie_twas.R path.
parameter: cis_window = 1000000
parameter: fine_mapping_methods = "susie"
parameter: twas_methods = "susie,susieInf,mrash,enet,lasso,mcp,scad,l0learn,bayes_r,bayes_c"
parameter: fine_mapping_coverage = 0.95
# Comma-separated context names to restrict both passes to; empty = all contexts.
parameter: contexts = ""
# Fine-mapping SER pre-screen (0 = off, <0 = adaptive 3/nVariants).
parameter: pip_cutoff_to_skip = 0.0
# TWAS weight-learning knobs.
parameter: min_twas_maf = 0.01
parameter: min_twas_xvar = 0.01
parameter: max_cv_variants = 5000
parameter: cv_folds = 5
parameter: cv_threads = 1
# Reproducibility: integer RNG seed; -1 = unset.
parameter: seed = -1
# Opt-in per-analysis overrides of the QtlDataset's construct-time filters,
# applied to both passes. Sentinels: a negative cutoff / drop_indel False /
# "." path = leave the dataset's stored value untouched.
parameter: maf_cutoff = -1.0
parameter: mac_cutoff = -1.0
parameter: xvar_cutoff = -1.0
parameter: imiss_cutoff = -1.0
parameter: drop_indel = False
parameter: keep_samples = "."
parameter: keep_variants = "."
# Optional JSON of per-method kwargs forwarded to the pipelines, e.g.
# '{"susie":{"L":10}}' for fine-mapping or '{"lasso":{"nfolds":10}}' for TWAS.
parameter: fine_mapping_method_args = ""
parameter: twas_method_args = ""
fail_if(len(region_name) == 0, "susie_twas: pass --region-name <gene-id> [<gene-id> ...] to choose which gene(s) to analyze.")
qtl_dataset = path(f"{cwd:a}/qtl_dataset/{name}.qtl_dataset.rds")
# Assemble the shared opt-in worker flags once (command construction only).
seed_arg = f"--seed {seed}" if seed >= 0 else ""
filter_overrides = " ".join(filter(None, [
f"--maf-cutoff {maf_cutoff}" if maf_cutoff >= 0 else "",
f"--mac-cutoff {mac_cutoff}" if mac_cutoff >= 0 else "",
f"--xvar-cutoff {xvar_cutoff}" if xvar_cutoff >= 0 else "",
f"--imiss-cutoff {imiss_cutoff}" if imiss_cutoff >= 0 else "",
"--drop-indel" if drop_indel else "",
f"--keep-samples {keep_samples}" if keep_samples != "." else "",
f"--keep-variants {keep_variants}" if keep_variants != "." else "",
]))
input: qtl_dataset, for_each = "region_name"
output: fine_mapping = f"{cwd:a}/fine_mapping/{name}.{_region_name}.univariate_bvsr.rds",
twas_weights = f"{cwd:a}/twas_weights/{name}.{_region_name}.univariate_twas_weights.rds"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f"{step_name}_{_output[0]:bnn}"
bash: expand = "${ }", stderr = f"{_output[0]:nn}.susie_twas.stderr", stdout = f"{_output[0]:nn}.susie_twas.stdout", container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/pecotmr_integration/fine_mapping.R \
--qtl-dataset ${_input} \
--gene-id ${_region_name} \
--cis-window ${cis_window} \
--methods ${fine_mapping_methods} \
--coverage ${fine_mapping_coverage} \
--pip-cutoff-to-skip ${pip_cutoff_to_skip} \
${seed_arg} ${filter_overrides} \
${("--contexts " + contexts) if contexts else ""} \
${("--method-args '" + fine_mapping_method_args + "'") if fine_mapping_method_args else ""} \
--output ${_output["fine_mapping"]}
Rscript ${modular_script_dir}/pecotmr_integration/twas_weights.R \
--qtl-dataset ${_input} \
--gene-id ${_region_name} \
--cis-window ${cis_window} \
--methods ${twas_methods} \
--min-twas-maf ${min_twas_maf} \
--min-twas-xvar ${min_twas_xvar} \
--max-cv-variants ${max_cv_variants} \
--cv-folds ${cv_folds} \
--cv-threads ${cv_threads} \
--fine-mapping-result ${_output["fine_mapping"]} \
${seed_arg} ${filter_overrides} \
${("--contexts " + contexts) if contexts else ""} \
${("--method-args '" + twas_method_args + "'") if twas_method_args else ""} \
--output ${_output["twas_weights"]}
[mnm]
# Prior model file generated from mashr.
# Default will be used if it does not exist.
parameter: mixture_prior = path()
parameter: mixture_prior_cv = path()
parameter: prior_weights_min = 5e-4
parameter: prior_canonical_matrices = False
parameter: sample_partition = path()
parameter: mvsusie_max_iter = 200
parameter: mrmash_max_iter = 5000
depends: sos_variable("regional_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
if skip_fine_mapping and skip_twas_weights:
save_data = True
output_files = [f'{cwd:a}/data/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multicontext_data.rds']
elif not skip_fine_mapping and skip_twas_weights:
output_files = [f'{cwd:a}/multivariate_fine_mapping/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multicontext_bvsr.rds']
elif skip_fine_mapping and not skip_twas_weights:
# Warning that fine-mapping is required for TWAS weights
if not ld_reference_meta_file.is_dir():
print("WARNING: Fine-mapping is required to generate TWAS weights. Setting skip_fine_mapping to False. "
"Fine-mapping will be performed and saved on variants listed in %s." % ld_reference_meta_file)
else:
print("WARNING: Fine-mapping is required to generate TWAS weights. Setting skip_fine_mapping to False.")
skip_fine_mapping = False
output_files = [f'{cwd:a}/multivariate_fine_mapping/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multicontext_bvsr.rds',
f'{cwd:a}/multivariate_twas_weights/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multicontext_twas_weights.rds']
else:
output_files = [f'{cwd:a}/multivariate_fine_mapping/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multicontext_bvsr.rds',
f'{cwd:a}/multivariate_twas_weights/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multicontext_twas_weights.rds']
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f"{_output[0]:nn}.mnm.stderr", stdout = f"{_output[0]:nn}.mnm.stdout", container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/mnm_analysis/mnm_methods/mnm.R \
--genotype ${_input[0]:a} \
--phenotype "${",".join([str(x.absolute()) for x in _input[1:len(_input)//2+1]])}" \
--covariate "${",".join([str(x.absolute()) for x in _input[len(_input)//2+1:]])}" \
--region "${_meta_info[0]}" \
--window "${_meta_info[1]}" \
--region-name "${_meta_info[2]}" \
--extract-region-names "${"|".join([x if isinstance(x,str) else ",".join(x) for x in _meta_info[3]])}" \
--conditions "${",".join(_meta_info[4:])}" \
--skip-analysis-pip-cutoff "${",".join(skip_analysis_pip_cutoff)}" \
--maf ${maf} \
--mac ${mac} \
--imiss ${imiss} \
${"--indel" if indel else ""} \
${"--keep-samples " + str(keep_samples) if keep_samples.is_file() else ""} \
${"--keep-variants " + str(keep_variants) if not keep_variants.is_dir() else ""} \
${"--save-data" if save_data else ""} \
--pip-cutoff ${pip_cutoff} \
--coverage "${",".join([str(x) for x in coverage])}" \
--seed ${seed} \
--cwd ${cwd:a} \
${"--skip-fine-mapping" if skip_fine_mapping else ""} \
${"--skip-twas-weights" if skip_twas_weights else ""} \
--xvar-cutoff ${xvar_cutoff} \
--mvsusie-max-iter ${mvsusie_max_iter} \
--mrmash-max-iter ${mrmash_max_iter} \
--max-cv-variants ${max_cv_variants} \
--twas-cv-folds ${twas_cv_folds} \
--twas-cv-threads ${twas_cv_threads} \
${"--ld-reference-meta-file " + str(ld_reference_meta_file) if not ld_reference_meta_file.is_dir() else ""} \
${"--mixture-prior " + str(mixture_prior) if mixture_prior.is_file() else ""} \
${"--mixture-prior-cv " + str(mixture_prior_cv) if mixture_prior_cv.is_file() else ""} \
--prior-weights-min ${prior_weights_min} \
${"--prior-canonical-matrices" if prior_canonical_matrices else ""} \
${"--sample-partition " + str(sample_partition) if sample_partition.is_file() else ""} \
--output-prefix ${_output[0]:nn}
[mnm_genes]
depends: sos_variable("regional_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
meta_info = regional_data['meta_info']
genes = meta_info[0][3]
if isinstance(genes, tuple):
genes = genes[0] if isinstance(genes[0], tuple) else list(genes)
else:
genes = [genes]
if len(skip_analysis_pip_cutoff) == 0:
skip_analysis_pip_cutoff = [0.0] * len(genes)
if len(skip_analysis_pip_cutoff) == 1:
skip_analysis_pip_cutoff = skip_analysis_pip_cutoff * len(genes)
skip_analysis_pip_cutoff = ["'{}'={}".format(genes[i], skip_analysis_pip_cutoff[i].split('=')[1]) for i in range(len(genes))]
parameter: pheno_id_map_file = path
parameter: fine_mapping_meta = path
parameter: data_driven_prior = False
# Parameters below are relevant when we use data driven prior
# which is an experimental feature
parameter: n_random = 10
parameter: n_null = 10
parameter: independent_variant_list = path()
parameter: prior_weights_min = 5e-4
parameter: prior_canonical_matrices = False
# This is relevant to cross-validation
parameter: sample_partition = path()
parameter: mvsusie_max_iter = 200
parameter: mrmash_max_iter = 5000
if not data_driven_prior:
prior_canonical_matrices = True
input: regional_data["data"],group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
if skip_fine_mapping and skip_twas_weights:
save_data = True
output_files = [f'{cwd:a}/data/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multigene_data.rds']
elif not skip_fine_mapping and skip_twas_weights:
output_files = [f'{cwd:a}/multivariate_fine_mapping/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multigene_bvsr.rds']
elif skip_fine_mapping and not skip_twas_weights:
output_files = [f'{cwd:a}/multivariate_twas_weights/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multigene_twas_weights.rds']
else:
output_files = [f'{cwd:a}/multivariate_fine_mapping/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multigene_bvsr.rds',
f'{cwd:a}/multivariate_twas_weights/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.multigene_twas_weights.rds']
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output[0]:nn}.mnm_genes.stdout", stderr = f"{_output[0]:nn}.mnm_genes.stderr", container = container, entrypoint = entrypoint
library(data.table)
library(dplyr)
library(pecotmr)
combine_result_list <- function(univariate_finemapping_metadata, condition_name, conditions = NULL, extracted_region){
regional_window_combined_susie_result <- list()
regional_window_combined_sumstats_result <- list()
extracted_univariate_finemapping_metadata <- univariate_finemapping_metadata%>%filter(region_id%in%extracted_region)
for(i in 1:dim(extracted_univariate_finemapping_metadata)[1]){
gene_id <- extracted_univariate_finemapping_metadata$region_id[i]
susie_result <- readRDS(extracted_univariate_finemapping_metadata$combined_data[i])
sumstats_result <- readRDS(extracted_univariate_finemapping_metadata$combined_data_sumstats[i])
if (!is.null(conditions)) {
if (any(names(susie_result[[gene_id]]) %in% conditions)) {
extracted_conditions = names(susie_result[[gene_id]])[names(susie_result[[gene_id]]) %in% conditions]
for (j in 1:length(extracted_conditions)){
regional_window_combined_susie_result[[condition_name]][[extracted_conditions[j]]] <- susie_result[[gene_id]][[extracted_conditions[j]]]
regional_window_combined_sumstats_result[[condition_name]][[extracted_conditions[j]]] <- sumstats_result[[gene_id]][[extracted_conditions[j]]]
}
}
} else {
regional_window_combined_susie_result[[condition_name]][[gene_id]] <- susie_result[[gene_id]][[condition_name]]
regional_window_combined_sumstats_result[[condition_name]][[gene_id]] <- sumstats_result[[gene_id]][[condition_name]]
}
}
return(list(regional_window_combined_susie_result = regional_window_combined_susie_result, regional_window_combined_sumstats_result = regional_window_combined_sumstats_result))
}
extract_non_null_cs_list <- function(combined_list, condition_name, extracted_region) {
# List to store results for genes passing the first step
extracted_regional_window_combined_susie_result <- list()
extracted_regional_window_combined_sumstats_result <- list()
# Iterate over each gene's susie and sumstats results
filtered_region <- extracted_region[extracted_region%in%names(combined_list[["regional_window_combined_susie_result"]][[condition_name]])]
for (gene_id in filtered_region) {
susie_result <- get_nested_element(combined_list,c("regional_window_combined_susie_result",condition_name,gene_id))
# Check if there exits top_loci.
if (!is.null(susie_result[["top_loci"]])&&nrow(susie_result[["top_loci"]])[1]!=0) {
# If not all zeros, add this gene's susie and sumstats results to the lists
extracted_regional_window_combined_susie_result[[condition_name]][[gene_id]] <- get_nested_element(combined_list,c("regional_window_combined_susie_result",condition_name,gene_id))
extracted_regional_window_combined_sumstats_result[[condition_name]][[gene_id]] <- get_nested_element(combined_list,c("regional_window_combined_sumstats_result",condition_name,gene_id))
}
}
return(list(extracted_regional_window_combined_susie_result = extracted_regional_window_combined_susie_result, extracted_regional_window_combined_sumstats_result = extracted_regional_window_combined_sumstats_result))
}
process_and_compare_variants <- function(combined_list, condition_name) {
extracted_regional_window_combined_susie_result <- list()
extracted_regional_window_combined_sumstats_result <- list()
compare_variants <- function(variants1, variants2) {
return(any(variants1 %in% variants2) | any(variants2 %in% variants1))
}
subset_top_loci_variants <- function(df) {
find_non_zero_rows <- function(df) {
non_zero_indices <- which(rowSums(df!= 0) > 0)
return(non_zero_indices)
}
non_zero_indices <- find_non_zero_rows(df)
if (length(non_zero_indices) > 0) {
non_zero_index_variants <- df[non_zero_indices, , drop = FALSE]%>%select(variant_id)%>%pull(variant_id)
zero_index_filtered_variants <- df[-non_zero_indices, ,drop = FALSE]%>%filter(pip>=0.01)%>%select(variant_id)%>%pull(variant_id)
top_loci_variants <- c(non_zero_index_variants, zero_index_filtered_variants)
} else {
zero_index_filtered_variants <- df%>%filter(pip>=0.01)%>%select(variant_id)%>%pull(variant_id)
top_loci_variants <- zero_index_filtered_variants
}
return(top_loci_variants)
}
for (i in 1:length(combined_list[["extracted_regional_window_combined_susie_result"]][[condition_name]])) {
susie_result_1 <- combined_list[["extracted_regional_window_combined_susie_result"]][[condition_name]][[i]]
gene_id <- names(combined_list[["extracted_regional_window_combined_susie_result"]][[condition_name]])[i]
# Extract top_loci variants
variants1 <- subset_top_loci_variants(susie_result_1[["top_loci"]])
if (length(variants1) > 0) {
different_in_all_genes <- TRUE
for (j in 1:length(combined_list[["extracted_regional_window_combined_susie_result"]][[condition_name]])) {
if (i != j) {
susie_result_2 <- combined_list[["extracted_regional_window_combined_susie_result"]][[condition_name]][[j]]
variants2 <- subset_top_loci_variants(susie_result_2[["top_loci"]])
# Compare variants
if (length(variants2) > 0) {
if (compare_variants(variants1, variants2)) {
# If variants are not totally different across all genes, set the flag to FALSE
different_in_all_genes <- FALSE
break
}
}
}
}
# After the loop, check the flag
if (!different_in_all_genes) {
extracted_regional_window_combined_susie_result[[condition_name]][[gene_id]] <- get_nested_element(combined_list, c("extracted_regional_window_combined_susie_result", condition_name, gene_id))
extracted_regional_window_combined_sumstats_result[[condition_name]][[gene_id]] <- get_nested_element(combined_list, c("extracted_regional_window_combined_sumstats_result", condition_name, gene_id))
}
}
}
return(list(extracted_regional_window_combined_susie_result = extracted_regional_window_combined_susie_result, extracted_regional_window_combined_sumstats_result = extracted_regional_window_combined_sumstats_result))
}
fine_mapping_metadata = fread(${fine_mapping_meta:r})
phenotype_files = c(${",".join(['"%s"' % x.absolute() for x in _input[1:len(_input)//2+1]])})
covariate_files = c(${",".join(['"%s"' % x.absolute() for x in _input[len(_input)//2+1:]])})
condition_name = ${("'%s'" % _meta_info[-1])}
region_names = ${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])}
if (any(region_names %in% fine_mapping_metadata$region_id)) {
filtered_region <- fine_mapping_metadata%>%filter(region_id%in%region_names)%>%select(region_id)%>%pull(region_id)
filtered_event_names <- filtered_region
} else {
pheno_id_map <- fread(${pheno_id_map_file:r})
gene_ids <- pheno_id_map%>%filter(ID%in%region_names)%>%select(gene)%>%pull(gene)%>%{.[!duplicated(.)]}
filtered_region <- fine_mapping_metadata%>%filter(region_id%in%gene_ids)%>%select(region_id)%>%pull(region_id)
filtered_event_names <- pheno_id_map%>%filter(ID%in%region_names&gene%in%filtered_region)%>%select(ID)%>%pull(ID)
}
keep_samples = NULL
if (${"TRUE" if keep_samples.is_file() else "FALSE"}) {
keep_samples = unlist(strsplit(readLines(${keep_samples:ar}), "\\s+"))
message(paste(length(keep_samples), "samples are selected to be loaded for analysis"))
}
fdat = load_regional_multivariate_data(genotype = ${_input[0]:anr},
phenotype = phenotype_files,
covariate = covariate_files,
region = ${("'%s'" % _meta_info[1]) if int(_meta_info[1].split('-')[-1])>0 else 'NULL'},
maf_cutoff = ${maf},
mac_cutoff = ${mac},
imiss_cutoff = ${imiss},
conditions = NULL,
association_window = ${("'%s'" % _meta_info[1]) if int(_meta_info[1].split('-')[-1])>0 else 'NULL'},
extract_region_name = list(filtered_event_names),
region_name_col = ${"4" if int(_meta_info[1].split('-')[-1])>0 else "1"},
keep_indel = ${"TRUE" if indel else "FALSE"},
keep_samples = keep_samples,
keep_variants = ${'"%s"' % keep_variants if not keep_variants.is_dir() else "NULL"},
phenotype_header = ${"4" if int(_meta_info[1].split('-')[-1])>0 else "1"}, # skip first 4 rows of transposed phenotype for chr, start, end and ID
scale_residuals = FALSE)
if (${"TRUE" if save_data else "FALSE"}) {
saveRDS(fdat, "${_output[0]:ann}.multigene_data.rds", compress='xz')
}
if (${"FALSE" if skip_fine_mapping and skip_twas_weights else "TRUE"}) {
# Extract and consolidate univariate results to determine which genes to focus on
if (grepl("sQTL", condition_name)) {conditions <- paste(condition_name, filtered_event_names, sep = "_")} else {conditions <- NULL}
combined_regional_window_list <- combine_result_list(fine_mapping_metadata, condition_name, conditions, filtered_region)
if (grepl("sQTL",condition_name)) { filtered_region <- paste(condition_name, filtered_event_names, sep = "_") }
extracted_non_null_combined_susie_list <- extract_non_null_cs_list(combined_list = combined_regional_window_list, condition_name, filtered_region)
if (length(extracted_non_null_combined_susie_list[["extracted_regional_window_combined_susie_result"]])!=0) {
extracted_combined_susie_list <- process_and_compare_variants(combined_list = extracted_non_null_combined_susie_list, condition_name)
} else {
extracted_combined_susie_list <- NULL
}
if (!is.null(extracted_combined_susie_list)&length(extracted_combined_susie_list[["extracted_regional_window_combined_susie_result"]])!=0) {
if (grepl("sQTL", condition_name)) {
filtered_event_names <- sub(paste0("^", condition_name, "_"), "", names(get_nested_element(extracted_combined_susie_list,c("extracted_regional_window_combined_susie_result", condition_name))))
} else if (grepl("pQTL", condition_name)) {
filtered_event_names <- pheno_id_map %>%
filter(ID%in%region_names)%>%
filter(gene %in% names(get_nested_element(extracted_combined_susie_list, c("extracted_regional_window_combined_susie_result", condition_name)))) %>%
select(ID) %>% pull(ID)
} else {
filtered_event_names <- names(get_nested_element(extracted_combined_susie_list, c("extracted_regional_window_combined_susie_result", condition_name)))
}
} else {
filtered_event_names <- NULL
}
if ("${_meta_info[2]}" != "${_meta_info[3]}") {
region_name = c("${_meta_info[2]}", c(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])}))
} else {
region_name = "${_meta_info[2]}"
}
region_info = list(region_coord=parse_region("${_meta_info[0]}"), grange=parse_region("${_meta_info[1]}"), region_name=region_name)
dd_prior <- NULL
if (${"TRUE" if data_driven_prior else "FALSE"}) {
# FIXME: please complete this function call
dd_prior <- multigene_udr(extracted_combined_susie_list, coverage = paste0("cs_coverage_",${coverage[0]},sep=""), independent_variant_list = ${independent_variant_list:r}, n_random = ${n_random}, n_null = ${n_null}, seed = ${seed})
}
if (length(filtered_event_names)>=2) {
pip_cutoff_to_skip = list(${",".join(skip_analysis_pip_cutoff)})
pip_cutoff_to_skip = unlist(pip_cutoff_to_skip[filtered_event_names])
dd_prior_cv <- dd_prior
set.seed(${seed})
result <- multivariate_analysis_pipeline(
X=fdat$X,
Y=fdat$residual_Y[,filtered_event_names],
maf=fdat$maf,
X_variance = fdat$X_variance,
other_quantities = list(dropped_samples = fdat$dropped_samples),
# filters
imiss_cutoff = ${imiss},
maf_cutoff = ${maf},
xvar_cutoff = ${xvar_cutoff},
ld_reference_meta_file = ${('"%s"' % ld_reference_meta_file) if not ld_reference_meta_file.is_dir() else "NULL"},
pip_cutoff_to_skip = pip_cutoff_to_skip,
# methods parameter configuration
max_L = -1,
data_driven_prior_matrices = dd_prior,
data_driven_prior_matrices_cv = dd_prior_cv,
data_driven_prior_weights_cutoff = ${prior_weights_min},
canonical_prior_matrices = ${"TRUE" if prior_canonical_matrices else "FALSE"},
mvsusie_max_iter = ${mvsusie_max_iter},
estimate_residual_variance = ${'TRUE' if estimate_residual_variance else 'FALSE'},
mrmash_max_iter = ${mrmash_max_iter},
# fine-mapping results summary
signal_cutoff = ${pip_cutoff},
coverage = c(${",".join([str(x) for x in coverage])}),
# TWAS weights and CV for TWAS weights
twas_weights = ${"TRUE" if not skip_twas_weights else "FALSE"},
sample_partition = ${"NULL" if sample_partition=='.' or sample_partition=='' else sample_partition},
max_cv_variants = ${max_cv_variants},
cv_folds = ${twas_cv_folds},
cv_threads = ${twas_cv_threads}
)
result$region_info <- region_info
if (!is.null(result$twas_weights_result)) {
for (r in names(result$twas_weights_result)) {
result$twas_weights_result[[r]]$region_info <- region_info
}
saveRDS(list("${_meta_info[2]}" = result$twas_weights_result), "${_output[0]:ann}.multigene_twas_weights.rds", compress='xz')
result$twas_weights_result <- NULL
}
saveRDS(list("${_meta_info[2]}" = result), "${_output[0]:ann}.multigene_bvsr.rds", compress='xz')
} else {
for (f in c(${_output:ar,})) {
saveRDS(NULL, f)
}
}
}
[fsusie]
# prior can be either of ["mixture_normal", "mixture_normal_per_scale"]
parameter: prior = "mixture_normal"
parameter: max_SNP_EM = 100
# Max scale is such that 2^max_scale being the number of phenotypes in the transformed space. Default to 2^10 = 1024. Don't change it unless you know what you are doing. Max_scale should be at least larger than 5.
parameter: max_scale = 10
# Purity and coverage used to call cs
parameter: min_purity = 0.5
# Epigenetics mark filter
parameter: epigenetics_mark_treshold = 16
# Run susie for top pc of the fsusie input
parameter: susie_top_pc = 10
# Run fsusie with this post-processing option. Available options are TI, HMM, and none. TI is recommended as default. HMM performs particularly well when analyzing data with, say, few sampling points (i.e., ncol(Y)<30) or when the data are particularly noisy (low signal-to-noise ratio).
parameter: post_processing = "TI"
# Run fsusie with small sample correction
parameter: small_sample_correction = False
depends: sos_variable("regional_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0].replace(":", "_").replace("-", "_")}.fsusie_{prior}_{post_processing}_{"_top_pc_weights" if not skip_twas_weights else ""}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/mnm_analysis/mnm_methods/fsusie.R \
--genotype ${_input[0]:a} \
--phenotype "${",".join([str(x.absolute()) for x in _input[1:len(_input)//2+1]])}" \
--covariate "${",".join([str(x.absolute()) for x in _input[len(_input)//2+1:]])}" \
--region "${_meta_info[0]}" \
--window "${_meta_info[1]}" \
--region-name "${_meta_info[2]}" \
--conditions "${",".join(_meta_info[4:])}" \
--maf ${maf} \
--mac ${mac} \
--imiss ${imiss} \
${"--indel" if indel else ""} \
${"--keep-samples " + str(keep_samples) if keep_samples.is_file() else ""} \
${"--keep-variants " + str(keep_variants) if not keep_variants.is_dir() else ""} \
--prior "${prior}" \
--max-snp-em ${max_SNP_EM} \
--max-scale ${max_scale} \
--min-purity ${min_purity} \
--epigenetics-mark-threshold ${epigenetics_mark_treshold} \
--susie-top-pc ${susie_top_pc} \
--post-processing "${post_processing}" \
${"--small-sample-correction" if small_sample_correction else ""} \
--pip-cutoff ${pip_cutoff} \
--coverage "${",".join([str(x) for x in coverage])}" \
--L-greedy ${init_L} \
--L ${L} \
${"--skip-twas-weights" if skip_twas_weights else ""} \
--max-cv-variants ${max_cv_variants} \
--twas-cv-folds ${twas_cv_folds} \
--twas-cv-threads ${twas_cv_threads} \
${"--save-data" if save_data else ""} \
--output-prefix ${_output:n} \
--cwd ${cwd:a}
mvfsusie (WIP placeholder)#
This section is still under development in the original pipeline and is included here as a placeholder only. It is not part of the runnable MWE.
[mvfsusie]
# prior can be either of ["mixture_normal", "mixture_normal_per_scale"]
parameter: prior = "mixture_normal_per_scale"
parameter: max_SNP_EM = 1000
depends: sos_variable("regional_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0]}.mvfsusie_{prior}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
# Load regional association data
fdat = load_regional_association_data(genotype = ${_input[0]:anr},
phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1::2]])}),
covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[2::2]])}),
region = ${'"%s:%s-%s"' % (_meta_info[1], _meta_info[2], _meta_info[3])},
maf_cutoff = ${maf},
mac_cutoff = ${mac},
imiss_cutoff = ${imiss})
# Fine-mapping with mvfSuSiE
library("mvf.susie.alpha")
Y = map(fdat$residual_Y, ~left_join(fdat$X[,1]%>%as.data.frame%>%rownames_to_column("rowname"), .x%>%t%>%as.data.frame%>%rownames_to_column("rowname") , by = "rowname")%>%select(-2)%>%column_to_rownames("rowname")%>%as.matrix )
fitted <- multfsusie(Y_f = list(Y[[1]],Y[[3]]),
Y_u = Reduce(cbind, Y[[2]]),
pos = list(pos1 =fdat$phenotype_coordiates[[1]], pos2 = fdat$phenotype_coordiates[[3]]),
X=X,
L=${max_L},
data.format="list_df")
saveRDS(fitted, ${_output:ar})