Mixture Multivariate Distribution Estimate#
This miniprotocol estimates a mixture multivariate prior, fits the Multivariate Adaptive Shrinkage (MASH) model, and visualizes the estimated covariance components. It is a thin driver that chains two reusable workflow notebooks:
mixture_prior.ipynb— computes a prior independent of the specific analysis method chosen for the data. This foundational step enables the application of various techniques such as UDR, ED, TED, and initialization with FLASHier, among others. The goal is to establish a mixture model to extract meaningful signals from the data. An earlier version of the approach is outlined in Urbut et al 2019; this workflow implements a few improvements including additional EBMF methods as well as the newudr(Ultimate Deconvolution in R) package to fit the mixture model.mash_fit.ipynb— once priors are calculated, the MASH model is fit and posteriors are calculated for variables of interest, with the objective of conducting a multivariate analysis under the MASH model. MASH has improved upon the Urbut et al 2019 paper.
Synthetic data note. This user-friendly copy is wired to small
protocol_example.*toy inputs underinput/(a 34-contextmashrinput RDS and the matching MASH input/prior/V files) so the whole chain runs end-to-end without access-controlled data. Swap in your own RDS files to run on real data.
Description#
This vignette describes the conceptual background and statistical theory behind the mixture prior estimation used in multivariate analysis. Unlike the executable mini-protocol notebooks, this vignette is primarily a reference document — it explains why the mixture prior is structured this way and how it relates to MASH covariance estimation.
For the runnable MWE, see mash_fit and mash_posterior. For the mixture prior step itself, see mixture_prior.
Input#
The mixture-prior step takes a single mashr input RDS (the --data argument). For the toy example this is input/mash/protocol_example.mashr_input.rds, a list with effect-size (*.b), z-score (*.z), standard-error (*.s) matrices for random, strong, and null SNP sets, plus the XtX matrix — each with rows = SNPs and columns = the conditions/contexts (e.g. cell types).
The MASH-fit step additionally consumes the prior produced by the first step together with the matching MASH input and V matrices.
File |
Description |
|---|---|
|
Mixture-prior input: list of effect/z/se matrices ( |
|
MASH input data (EE effect model) consumed by |
|
Estimated residual correlation matrix |
|
Mixture prior RDS (also produced by step 1) consumed by |
Workflow options carried by the underlying notebooks:
--vhat: one ofidentity,simple,mle,vhat_corshrink_xcondition, orvhat_simple_specific.--cwd: output path.--vhat-data: formash_fit, thevhatdata RDS produced in themixture_priorstep.--prior-data: formash_fit, the prior data RDS produced in themixture_priorstep.--compute-posterior: formash_fit, whether posterior probabilities should be calculated.
Output#
The mixture-prior step writes the estimated data-driven covariance components and the assembled mixture prior; the MASH-fit step writes the fitted MASH model and (optionally) per-unit posteriors; the plotting step writes a PDF visualizing the prior covariance matrices.
File |
Description |
|---|---|
|
Assembled mixture prior (data-driven + canonical covariance components) from |
|
Fitted MASH model and posterior summaries (loglik, fitted_g, posterior means/SDs) |
|
Heatmap visualization of the estimated prior covariance matrices |
Steps#
The miniprotocol runs in three steps. Each step calls a reusable workflow notebook through the pipeline/ symlinks. Outputs of step 1 feed into steps 2 and 3.
Step 1. Compute the MASH mixture prior with the ed_bovy workflow of mixture_prior (estimates data-driven covariance components and assembles the prior).
Step 2. Fit the MASH model with the mash workflow of mash_fit, supplying the prior and vhat/MASH input from step 1, and compute posteriors.
Step 3. Visualize the estimated prior covariance matrices with the plot_U workflow of mixture_prior.
sos run pipeline/mixture_prior.ipynb ed_bovy \
--output-prefix protocol_example \
--data input/mash/protocol_example.mashr_input.rds \
--cwd output/mixture_prior
sos run pipeline/mash_fit.ipynb mash \
--output-prefix protocol_example_mash \
--data input/mash/protocol_example.EE.mash.rds \
--vhat-data input/mash/protocol_example.EE.V_simple.rds \
--prior-data output/mixture_prior/protocol_example.EE.prior.rds \
--effect-model EE \
--compute-posterior \
--cwd output/mash_fit
sos run pipeline/mixture_prior.ipynb plot_U \
--output-prefix protocol_example_plots \
--data output/mixture_prior/protocol_example.EE.prior.rds \
--cwd output/mixture_prior
Command interface#
Each step is driven by one of two reusable workflow notebooks. List their full option sets with -h:
sos run pipeline/mixture_prior.ipynb -h
sos run pipeline/mash_fit.ipynb -h
Workflow implementation#
This notebook is a driver/miniprotocol: it does not define its own SoS sections but orchestrates two reusable workflow notebooks via the pipeline/ symlinks:
pipeline/mixture_prior.ipynb— provides theed_bovy(mixture-prior estimation) andplot_U(prior visualization) workflows, among other covariance-estimation methods (flash,pca,canonical,ud,vhat_*, …).pipeline/mash_fit.ipynb— provides themashworkflow (MASH model fit + posterior computation).
Refer to those notebooks for the underlying R implementation of each step.
Troubleshooting#
Issue |
Cause |
Solution |
|---|---|---|
|
|
Confirm |
Step 2 fails on |
Step 1 has not been run yet, so the prior/vhat RDS do not exist |
Run Step 1 first; it writes the prior into |
|
The workflow name was omitted from |
Always pass the explicit workflow ( |
Empty or tiny plot PDF |
|
Use the |
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.