MASH analysis pipeline with data-driven prior matrices

MASH analysis pipeline with data-driven prior matrices#

In this notebook, we utilize the MASH prior, referred to as the mixture_prior, from a previous step. Our objective is to conduct a multivariate analysis under the MASH model. After fitting the model, we subsequently compute the posteriors for our variables of interest.

Prerequisites#

Requires protocol_example.mash_input.rds from mash_preprocessing. See that notebook to generate this file.

Description#

Multivariate adaptive shrinkage (MASH) analysis of eQTL data#

Since we published Urbut 2019, we have improved implementation of MASH algorithm and made a new R package, mashr. Major improvements compared to Urbut 2019 are:

Faster computation of likelihood and posterior quantities via matrix algebra tricks and a C++ implementation.
Faster computation of MASH mixture via convex optimization.
New ways to estimate prior in place of the SFA approach, see mixture_prior.ipynb for details.
Improve estimate of residual variance \(\hat{V}\).

At this point, the input data have already been converted from the original association summary statistics to a format convenient for analysis in MASH.

Input#

--data: input summary statistics data (MASH-format RDS) produced by mash_preprocessing.
--vhat-data: estimated residual variance \(\hat{V}\) (RDS).
--prior-data: the data-driven prior matrices RDS generated by mixture_prior.

MWE data are available on synapse.org.

Output#

A fitted MASH model file *.mash_model.rds containing the mixture weights.
(Optional) a posterior file *.posterior.rds with posterior quantities (e.g. lfsr) for the “strong” set of gene-SNP pairs.

Steps#

Using the MASH prior previously generated, fit the MASH model and compute posteriors.

Step 1. Fit the MASH mixture model and compute posteriors for the strong set:

Timing (toy chr22 dataset):

mash_fit: ~1-2 min

sos run pipeline/mash_fit.ipynb mash \
    --output-prefix protocol_example_mash \
    --data input/mash/protocol_example.EE.mash.rds \
    --vhat-data input/mash/protocol_example.EE.V_simple.rds \
    --prior-data input/mash/protocol_example.EE.prior.rds \
    --effect-model EE \
    --compute-posterior \
    --cwd output/mash_fit

Command interface#

sos run pipeline/mash_fit.ipynb -h

Workflow implementation#

The mash workflow has two steps. mash_1 fits the mashr mixture model; mash_2 (optional, on by default) computes posteriors for the “strong” set as in Urbut et al 2017. Use --no-compute-posterior to skip the posterior step.

[global]
parameter: cwd = path('./mashr_workflow_output')
# Input summary statistics data
parameter: data = path
parameter: prior_data = path
parameter: vhat_data = path
# Prefix of output files. If not specified, it will derive it from data.
# If it is specified, for example, `--output-prefix AnalysisResults`
# It will save output files as `{cwd}/AnalysisResults*`.
parameter: output_prefix = ''
parameter: output_suffix = 'all'
# Exchangable effect (EE) or exchangable z-scores (EZ)
parameter: effect_model = 'EE'
parameter: container = ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
data = data.absolute()
cwd = cwd.absolute()
prior_data = prior_data.absolute()
vhat_data = vhat_data.absolute()
if len(output_prefix) == 0:
    output_prefix = f"{data:bn}"

mash_model = file_target(f"{cwd:a}/{output_prefix}.{effect_model}.mash_model.rds")

def sort_uniq(seq):
    seen = set()
    return [x for x in seq if not (x in seen or seen.add(x))]

`mashr` mixture model fitting#

Main reference are our mashr vignettes this for mashr eQTL outline and this for using FLASH prior.

The outcome of this workflow should be found under ./mashr_workflow_output folder (can be configured). File names have pattern *.mash_model_*.rds. They can be used to computer posterior for input list of gene-SNP pairs (see next section).

# Fit MASH mixture model (time estimate: <15min for 70K by 49 matrix)
[mash_1]
parameter: output_level = 4
input: data, vhat_data, prior_data
output: mash_model

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    library(mashr)
    dat = readRDS(${_input[0]:r})
    vhat = readRDS(${_input[1]:r})
    U = readRDS(${_input[2]:r})$U
    mash_data = mash_set_data(dat$random.b, Shat=dat$random.s, alpha=${1 if effect_model == 'EZ' else 0}, V=vhat, zero_Bhat_Shat_reset = 1E3)
    saveRDS(list(mash_model = mash(mash_data, Ulist = U, outputlevel = ${output_level}), vhat_file=${_input[1]:r}, prior_file=${_input[2]:r}), ${_output:r})

Optional posterior computations#

Additionally provide posterior for the “strong” set in MASH input data.

Anticipated Results#

Produces protocol_example.mash_model.rds with fitted MASH covariance matrices across 16 conditions. Expect ~10 canonical + data-driven covariance components.

# Compute posterior for the "strong" set of data as in Urbut et al 2017.
# This is optional because most of the time we want to apply the 
# MASH model learned on much larger data-set.
[mash_2]
# default to True; use --no-compute-posterior to disable this
parameter: compute_posterior = True
# input Vhat file for the batch of posterior data
skip_if(not compute_posterior)

input: data, vhat_data, mash_model
output: f"{cwd:a}/{output_prefix}.{effect_model}.posterior.rds"

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand = "${ }", workdir = cwd, stderr = f"{_output:n}.stderr", stdout = f"{_output:n}.stdout", container = container
    library(mashr)
    dat = readRDS(${_input[0]:r})
    vhat = readRDS(${_input[1]:r})
    mash_data = mash_set_data(dat$strong.b, Shat=dat$strong.s, alpha=${1 if effect_model == 'EZ' else 0}, V=vhat, zero_Bhat_Shat_reset = 1E3)
    mash_model = readRDS(${_input[2]:ar})$mash_model
    saveRDS(mash_compute_posterior_matrices(mash_model, mash_data), ${_output:r})