MASH analysis pipeline with data-driven prior matrices#
In this notebook, we utilize the MASH prior, referred to as the mixture_prior, from a previous step. Our objective is to conduct a multivariate analysis under the MASH model. After fitting the model, we subsequently compute the posteriors for our variables of interest.
Prerequisites#
Requires protocol_example.mash_input.rds from mash_preprocessing. See that notebook to generate this file.
Description#
Multivariate adaptive shrinkage (MASH) analysis of eQTL data#
Since we published Urbut 2019, we have improved implementation of MASH algorithm and made a new R package, mashr. Major improvements compared to Urbut 2019 are:
Faster computation of likelihood and posterior quantities via matrix algebra tricks and a C++ implementation.
Faster computation of MASH mixture via convex optimization.
New ways to estimate prior in place of the
SFAapproach, seemixture_prior.ipynbfor details.Improve estimate of residual variance \(\hat{V}\).
At this point, the input data have already been converted from the original association summary statistics to a format convenient for analysis in MASH.
Input#
--data: input summary statistics data (MASH-format RDS) produced bymash_preprocessing.--vhat-data: estimated residual variance \(\hat{V}\) (RDS).--prior-data: the data-driven prior matrices RDS generated bymixture_prior.
MWE data are available on synapse.org.
Output#
A fitted MASH model file
*.mash_model.rdscontaining the mixture weights.(Optional) a posterior file
*.posterior.rdswith posterior quantities (e.g.lfsr) for the “strong” set of gene-SNP pairs.
Steps#
Using the MASH prior previously generated, fit the MASH model and compute posteriors.
Step 1. Fit the MASH mixture model and compute posteriors for the strong set:
Timing (toy chr22 dataset):
mash_fit: ~1-2 min
sos run pipeline/mash_fit.ipynb mash \
--output-prefix protocol_example_mash \
--data input/mash/protocol_example.EE.mash.rds \
--vhat-data input/mash/protocol_example.EE.V_simple.rds \
--prior-data input/mash/protocol_example.EE.prior.rds \
--effect-model EE \
--compute-posterior \
--cwd output/mash_fit
Command interface#
sos run pipeline/mash_fit.ipynb -h
Workflow implementation#
The mash workflow has two steps. mash_1 fits the mashr mixture model; mash_2 (optional, on by default) computes posteriors for the “strong” set as in Urbut et al 2017. Use --no-compute-posterior to skip the posterior step.
[global]
parameter: cwd = path('./mashr_workflow_output')
# Input summary statistics data
parameter: data = path
parameter: prior_data = path
parameter: vhat_data = path
# Prefix of output files. If not specified, it will derive it from data.
# If it is specified, for example, `--output-prefix AnalysisResults`
# It will save output files as `{cwd}/AnalysisResults*`.
parameter: output_prefix = ''
parameter: output_suffix = 'all'
# Exchangable effect (EE) or exchangable z-scores (EZ)
parameter: effect_model = 'EE'
parameter: container = ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
data = data.absolute()
cwd = cwd.absolute()
prior_data = prior_data.absolute()
vhat_data = vhat_data.absolute()
if len(output_prefix) == 0:
output_prefix = f"{data:bn}"
mash_model = file_target(f"{cwd:a}/{output_prefix}.{effect_model}.mash_model.rds")
def sort_uniq(seq):
seen = set()
return [x for x in seq if not (x in seen or seen.add(x))]
mashr mixture model fitting#
Main reference are our mashr vignettes this for mashr eQTL outline and this for using FLASH prior.
The outcome of this workflow should be found under ./mashr_workflow_output folder (can be configured). File names have pattern *.mash_model_*.rds. They can be used to computer posterior for input list of gene-SNP pairs (see next section).
# Fit MASH mixture model (time estimate: <15min for 70K by 49 matrix)
[mash_1]
parameter: output_level = 4
input: data, vhat_data, prior_data
output: mash_model
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
library(mashr)
dat = readRDS(${_input[0]:r})
vhat = readRDS(${_input[1]:r})
U = readRDS(${_input[2]:r})$U
mash_data = mash_set_data(dat$random.b, Shat=dat$random.s, alpha=${1 if effect_model == 'EZ' else 0}, V=vhat, zero_Bhat_Shat_reset = 1E3)
saveRDS(list(mash_model = mash(mash_data, Ulist = U, outputlevel = ${output_level}), vhat_file=${_input[1]:r}, prior_file=${_input[2]:r}), ${_output:r})
Optional posterior computations#
Additionally provide posterior for the “strong” set in MASH input data.
Anticipated Results#
Produces protocol_example.mash_model.rds with fitted MASH covariance matrices across 16 conditions. Expect ~10 canonical + data-driven covariance components.
# Compute posterior for the "strong" set of data as in Urbut et al 2017.
# This is optional because most of the time we want to apply the
# MASH model learned on much larger data-set.
[mash_2]
# default to True; use --no-compute-posterior to disable this
parameter: compute_posterior = True
# input Vhat file for the batch of posterior data
skip_if(not compute_posterior)
input: data, vhat_data, mash_model
output: f"{cwd:a}/{output_prefix}.{effect_model}.posterior.rds"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
library(mashr)
dat = readRDS(${_input[0]:r})
vhat = readRDS(${_input[1]:r})
mash_data = mash_set_data(dat$strong.b, Shat=dat$strong.s, alpha=${1 if effect_model == 'EZ' else 0}, V=vhat, zero_Bhat_Shat_reset = 1E3)
mash_model = readRDS(${_input[2]:ar})$mash_model
saveRDS(mash_compute_posterior_matrices(mash_model, mash_data), ${_output:r})