Mixture Multivariate Distribution Estimate

Mixture Multivariate Distribution Estimate#

This miniprotocol estimates a mixture multivariate prior, fits the Multivariate Adaptive Shrinkage (MASH) model, and visualizes the estimated covariance components. It is a thin driver that chains two reusable workflow notebooks:

mixture_prior.ipynb — computes a prior independent of the specific analysis method chosen for the data. This foundational step enables the application of various techniques such as UDR, ED, TED, and initialization with FLASHier, among others. The goal is to establish a mixture model to extract meaningful signals from the data. An earlier version of the approach is outlined in Urbut et al 2019; this workflow implements a few improvements including additional EBMF methods as well as the new udr (Ultimate Deconvolution in R) package to fit the mixture model.
mash_fit.ipynb — once priors are calculated, the MASH model is fit and posteriors are calculated for variables of interest, with the objective of conducting a multivariate analysis under the MASH model. MASH has improved upon the Urbut et al 2019 paper.

Synthetic data note. This user-friendly copy is wired to small protocol_example.* toy inputs under input/ (a 34-context mashr input RDS and the matching MASH input/prior/V files) so the whole chain runs end-to-end without access-controlled data. Swap in your own RDS files to run on real data.

Description#

This vignette describes the conceptual background and statistical theory behind the mixture prior estimation used in multivariate analysis. Unlike the executable mini-protocol notebooks, this vignette is primarily a reference document — it explains why the mixture prior is structured this way and how it relates to MASH covariance estimation.

For the runnable MWE, see mash_fit and mash_posterior. For the mixture prior step itself, see mixture_prior.

Input#

The mixture-prior step takes a single mashr input RDS (the --data argument). For the toy example this is input/mash/protocol_example.mashr_input.rds, a list with effect-size (*.b), z-score (*.z), standard-error (*.s) matrices for random, strong, and null SNP sets, plus the XtX matrix — each with rows = SNPs and columns = the conditions/contexts (e.g. cell types).

The MASH-fit step additionally consumes the prior produced by the first step together with the matching MASH input and V matrices.

File	Description
`input/mash/protocol_example.mashr_input.rds`	Mixture-prior input: list of effect/z/se matrices (`random`, `strong`, `null`) + `XtX`, used by `mixture_prior`
`input/mash/protocol_example.EE.mash.rds`	MASH input data (EE effect model) consumed by `mash_fit`
`input/mash/protocol_example.EE.V_simple.rds`	Estimated residual correlation matrix `V` for `mash_fit`
`input/mash/protocol_example.EE.prior.rds`	Mixture prior RDS (also produced by step 1) consumed by `mash_fit`

Workflow options carried by the underlying notebooks:

--vhat: one of identity, simple, mle, vhat_corshrink_xcondition, or vhat_simple_specific.
--cwd: output path.
--vhat-data: for mash_fit, the vhat data RDS produced in the mixture_prior step.
--prior-data: for mash_fit, the prior data RDS produced in the mixture_prior step.
--compute-posterior: for mash_fit, whether posterior probabilities should be calculated.

Output#

The mixture-prior step writes the estimated data-driven covariance components and the assembled mixture prior; the MASH-fit step writes the fitted MASH model and (optionally) per-unit posteriors; the plotting step writes a PDF visualizing the prior covariance matrices.

File	Description
`output/mixture_prior/protocol_example.EE.prior.rds`	Assembled mixture prior (data-driven + canonical covariance components) from `ed_bovy`
`output/mash_fit/protocol_example_mash.*.rds`	Fitted MASH model and posterior summaries (loglik, fitted_g, posterior means/SDs)
`output/mixture_prior/protocol_example_plots.EE.prior.pdf`	Heatmap visualization of the estimated prior covariance matrices

Steps#

The miniprotocol runs in three steps. Each step calls a reusable workflow notebook through the pipeline/ symlinks. Outputs of step 1 feed into steps 2 and 3.

Step 1. Compute the MASH mixture prior with the ed_bovy workflow of mixture_prior (estimates data-driven covariance components and assembles the prior).

Step 2. Fit the MASH model with the mash workflow of mash_fit, supplying the prior and vhat/MASH input from step 1, and compute posteriors.

Step 3. Visualize the estimated prior covariance matrices with the plot_U workflow of mixture_prior.

sos run pipeline/mixture_prior.ipynb ed_bovy \
    --output-prefix protocol_example \
    --data input/mash/protocol_example.mashr_input.rds \
    --cwd output/mixture_prior

sos run pipeline/mash_fit.ipynb mash \
    --output-prefix protocol_example_mash \
    --data input/mash/protocol_example.EE.mash.rds \
    --vhat-data input/mash/protocol_example.EE.V_simple.rds \
    --prior-data output/mixture_prior/protocol_example.EE.prior.rds \
    --effect-model EE \
    --compute-posterior \
    --cwd output/mash_fit

sos run pipeline/mixture_prior.ipynb plot_U \
    --output-prefix protocol_example_plots \
    --data output/mixture_prior/protocol_example.EE.prior.rds \
    --cwd output/mixture_prior

Command interface#

Each step is driven by one of two reusable workflow notebooks. List their full option sets with -h:

sos run pipeline/mixture_prior.ipynb -h
sos run pipeline/mash_fit.ipynb -h

Workflow implementation#

This notebook is a driver/miniprotocol: it does not define its own SoS sections but orchestrates two reusable workflow notebooks via the pipeline/ symlinks:

pipeline/mixture_prior.ipynb — provides the ed_bovy (mixture-prior estimation) and plot_U (prior visualization) workflows, among other covariance-estimation methods (flash, pca, canonical, ud, vhat_*, …).
pipeline/mash_fit.ipynb — provides the mash workflow (MASH model fit + posterior computation).

Refer to those notebooks for the underlying R implementation of each step.

Troubleshooting#

Issue	Cause	Solution
`cannot open compressed file ... No such file or directory`	`--data` RDS path is wrong or the toy input is not staged	Confirm `input/mash/protocol_example.mashr_input.rds` exists (it is a symlink into `data/`)
Step 2 fails on `--prior-data` / `--vhat-data`	Step 1 has not been run yet, so the prior/vhat RDS do not exist	Run Step 1 first; it writes the prior into `output/mixture_prior/`
`Incorrect workflow name`	The workflow name was omitted from `sos run`	Always pass the explicit workflow (`ed_bovy`, `mash`, or `plot_U`) before the `--` parameters
Empty or tiny plot PDF	`plot_U` was pointed at a prior RDS that has no data-driven components	Use the `*.EE.prior.rds` produced by Step 1 as `--data`

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.