Extract genome-wide data for multivariate analysis#

This notebook prepares input data for Ultimate Deconvolution (to generate the mixture prior for mvSuSiE) or for MASH analysis. It extracts three sets of effects from genome-wide cis fine-mapping results: strong (Z_s), null (Z_n), and random (Z_r).

Prerequisites#

Requires genome-wide SuSiE fine-mapping outputs. See the Important Note below for toy-data limitations.

Description#

The workflow produces three z-score matrices used to fit a multivariate prior:

  • Strong effects (Z_s) — extracted from genome-wide cis fine-mapping results. The top locus per condition is taken (credible-set coverage threshold 0.7) and the z-scores are merged into one data frame.

  • Null effects (Z_n) — up to M candidate SNPs per region satisfying |z| ≤ 2 are overlapped with the independent-SNP list to keep only independent variants; the union is taken.

  • Random effects (Z_r) — variants randomly sampled from the supplied independent-variant list.

These matrices feed Ultimate Deconvolution / MASH to learn patterns of effect sharing across conditions (e.g. cell types or tissues).

Input#

  1. Marginal summary statistics files — bgzipped summary statistics for chromosomes 1–22, generated by tensorQTL cis-analysis and indexed by tabix.

  2. Fine-mapping results file index — path to lists of fine-mapped RDS files from fine-mapping output (e.g. susie_list.tsv).

  3. Genome region partition (optional) — defines genomic regions per gene as enhanced cis regions from which Z_n and Z_r are extracted. If the complete list of fine-mapping RDS is already available (#2), this file is not required; otherwise extraction is limited to the regions listed (useful for testing).

  4. Independent variant list (optional) — list of LD-pruned independent SNPs used to keep only independent null/random variants.

Output#

A list of 10 elements containing the strong.b, random.b, and null.b data frames (z-scores across all conditions), ready for Ultimate Deconvolution / MASH. Example structure:

List of 10
 $ strong.b:'data.frame':	62 obs. of  16 variables:
  ..$ Mic_De_Jager_eQTL  : num [1:62] 0.0402 0.0177 0.1852 -0.2188 0.1352 ...
  ..$ Ast_De_Jager_eQTL  : num [1:62] -0.74 -0.215 0.165 -0.263 0.113 ...
  ..$ Oli_De_Jager_eQTL  : num [1:62] -0.0564 0.195 0.1379 -0.1245 -0.061 ...
  ..$ OPC_De_Jager_eQTL  : num [1:62] 0.1748 0.2064 0.1182 -0.0751 0.1023 ...
  ..$ Exc_De_Jager_eQTL  : num [1:62] -0.56004 0.04682 0.2279 -0.13765 0.00667 ...
  ..$ Inh_De_Jager_eQTL  : num [1:62] -0.6306 0.1502 0.1777 -0.2118 -0.0562 ...
  ..$ DLPFC_De_Jager_eQTL: num [1:62] -0.0379 0.1315 0.0361 -0.315 -0.1005 ...
  ..$ PCC_De_Jager_eQTL  : num [1:62] -0.0403 0.0209 0.0614 -0.1755 0.0422 ...
  ..$ AC_De_Jager_eQTL   : num [1:62] -0.13649 -0.00365 0.08409 -0.19717 0.08202 ...
  ..$ Mic_Kellis_eQTL    : num [1:62] -0.396 0 0 0 0 ...
  ..$ Ast_Kellis_eQTL    : num [1:62] -0.9948 0.1155 0.0732 -0.2074 -0.044 ...
  ..$ Oli_Kellis_eQTL    : num [1:62] -0.559 0.387 0.1339 -0.3099 -0.0334 ...
  ..$ OPC_Kellis_eQTL    : num [1:62] -0.11556 0.17907 0.06332 0.06868 0.00266 ...
  ..$ Exc_Kellis_eQTL    : num [1:62] -0.5978 0.0797 0.1463 -0.1659 0.049 ...
  ..$ Inh_Kellis_eQTL    : num [1:62] -0.40719 0.04581 0.12052 -0.2968 0.00677 ...
  ..$ Ast.10_Kellis_eQTL : num [1:62] 0 0 0 0 0 0 0 0 0 0 ...
 $ random.b:'data.frame':	200 obs. of  16 variables:
  ..$ Mic_De_Jager_eQTL  : num [1:200] -0.014 -0.0691 -0.0426 -0.1391 0.1639 ...
  ..$ Ast_De_Jager_eQTL  : num [1:200] -0.006636 -0.051634 0.064082 0.000642 -0.040912 ...
  ..$ Oli_De_Jager_eQTL  : num [1:200] -0.0702 -0.0912 0.0985 -0.0492 0.0375 ...
  ..$ OPC_De_Jager_eQTL  : num [1:200] 0.0517 0.0577 -0.0443 0.0677 0.019 ...
  ..$ Exc_De_Jager_eQTL  : num [1:200] 0.0404 -0.02711 0.02452 -0.01076 0.00808 ...
  ..$ Inh_De_Jager_eQTL  : num [1:200] -0.09783 -0.0174 0.01247 0.00424 0.0561 ...
  ..$ DLPFC_De_Jager_eQTL: num [1:200] -0.005966 0.006705 0.000253 -0.004673 0.001346 ...
  ..$ PCC_De_Jager_eQTL  : num [1:200] -0.0113 -0.0248 -0.0239 0.0113 0.0307 ...
  ..$ AC_De_Jager_eQTL   : num [1:200] -0.01516 0.00196 -0.05586 0.01852 0.0188 ...
  ..$ Mic_Kellis_eQTL    : num [1:200] 0.0641 -0.1708 -0.0346 -0.0661 0.0293 ...
  ..$ Ast_Kellis_eQTL    : num [1:200] 0.05854 -0.03466 -0.03777 -0.00221 -0.01986 ...
  ..$ Oli_Kellis_eQTL    : num [1:200] -0.01408 0.02196 0.02674 0.00619 -0.03527 ...
  ..$ OPC_Kellis_eQTL    : num [1:200] -0.018068 -0.026661 -0.051184 0.000499 -0.018648 ...
  ..$ Exc_Kellis_eQTL    : num [1:200] 0.00046 -0.08625 0.02572 0.00127 -0.07891 ...
  ..$ Inh_Kellis_eQTL    : num [1:200] -0.07466 -0.03319 -0.00879 0.06033 -0.02189 ...
  ..$ Ast.10_Kellis_eQTL : num [1:200] 0 0 0 0 0 0 0 0 0 0 ...
 $ null.b  :'data.frame':	200 obs. of  16 variables:
  ..$ Mic_De_Jager_eQTL  : num [1:200] -0.0252 -0.0296 -0.1086 -0.0394 -0.2567 ...
  ..$ Ast_De_Jager_eQTL  : num [1:200] -0.0597 0.0402 -0.03 0.0394 -0.1582 ...
  ..$ Oli_De_Jager_eQTL  : num [1:200] -0.0376 0.0668 0.1072 0.0819 -0.3407 ...
  ..$ OPC_De_Jager_eQTL  : num [1:200] 0.058 0.0907 -0.0941 0.0193 -0.0585 ...
  ..$ Exc_De_Jager_eQTL  : num [1:200] 0.01518 -0.00338 0.01137 -0.03883 -0.02836 ...
  ..$ Inh_De_Jager_eQTL  : num [1:200] 0.00196 -0.05773 0.06773 -0.08281 0.29775 ...
  ..$ DLPFC_De_Jager_eQTL: num [1:200] 0.000521 -0.004139 -0.010968 -0.019917 0.081309 ...
  ..$ PCC_De_Jager_eQTL  : num [1:200] -0.0193 0.0273 -0.0237 0.0212 -0.0884 ...
  ..$ AC_De_Jager_eQTL   : num [1:200] 0.001409 0.016122 -0.049309 -0.000438 -0.045112 ...
  ..$ Mic_Kellis_eQTL    : num [1:200] 0.0506 0.0105 0.0444 0.0287 0.4847 ...
  ..$ Ast_Kellis_eQTL    : num [1:200] 1.18e-02 1.34e-06 1.17e-01 1.28e-01 -5.49e-01 ...
  ..$ Oli_Kellis_eQTL    : num [1:200] 0.0742 0.0675 -0.025 -0.0363 0.2675 ...
  ..$ OPC_Kellis_eQTL    : num [1:200] -0.0333 -0.0226 -0.0231 0.0717 -0.0915 ...
  ..$ Exc_Kellis_eQTL    : num [1:200] 0.03593 0.00316 0.01524 -0.0089 0.08042 ...
  ..$ Inh_Kellis_eQTL    : num [1:200] -0.043 -0.0471 0.0418 0.0791 -0.0831 ...
  ..$ Ast.10_Kellis_eQTL : num [1:200] 0 0 0 0 0 0 0 0 0 0 ...
 $ strong.s:'data.frame':	62 obs. of  16 variables:
  ..$ Mic_De_Jager_eQTL  : num [1:62] 0.3036 0.0925 0.0569 0.095 0.0626 ...
  ..$ Ast_De_Jager_eQTL  : num [1:62] 0.2298 0.0777 0.0476 0.08 0.053 ...
  ..$ Oli_De_Jager_eQTL  : num [1:62] 0.2397 0.0912 0.0556 0.0944 0.0623 ...
  ..$ OPC_De_Jager_eQTL  : num [1:62] 0.3184 0.1084 0.0672 0.1122 0.0741 ...
  ..$ Exc_De_Jager_eQTL  : num [1:62] 0.1051 0.0577 0.0336 0.0586 0.039 ...
  ..$ Inh_De_Jager_eQTL  : num [1:62] 0.198 0.0861 0.0513 0.0872 0.0581 ...
  ..$ DLPFC_De_Jager_eQTL: num [1:62] 0.0511 0.0366 0.0225 0.0349 0.0249 ...
  ..$ PCC_De_Jager_eQTL  : num [1:62] 0.0844 0.0551 0.0326 0.0543 0.0367 ...
  ..$ AC_De_Jager_eQTL   : num [1:62] 0.0729 0.0483 0.0302 0.0484 0.0343 ...
  ..$ Mic_Kellis_eQTL    : num [1:62] 0.319 1000 1000 1000 1000 ...
  ..$ Ast_Kellis_eQTL    : num [1:62] 0.3174 0.0941 0.054 0.0926 0.0604 ...
  ..$ Oli_Kellis_eQTL    : num [1:62] 0.2342 0.0776 0.0471 0.0783 0.0527 ...
  ..$ OPC_Kellis_eQTL    : num [1:62] 0.3137 0.0972 0.0568 0.0953 0.0631 ...
  ..$ Exc_Kellis_eQTL    : num [1:62] 0.1788 0.0754 0.0442 0.0756 0.0495 ...
  ..$ Inh_Kellis_eQTL    : num [1:62] 0.2481 0.0783 0.0454 0.0762 0.0506 ...
  ..$ Ast.10_Kellis_eQTL : num [1:62] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
 $ null.s  :'data.frame':	200 obs. of  16 variables:
  ..$ Mic_De_Jager_eQTL  : num [1:200] 0.0633 0.0667 0.0866 0.0795 0.3139 ...
  ..$ Ast_De_Jager_eQTL  : num [1:200] 0.0486 0.0502 0.0665 0.0603 0.2429 ...
  ..$ Oli_De_Jager_eQTL  : num [1:200] 0.0498 0.0522 0.0671 0.0621 0.2443 ...
  ..$ OPC_De_Jager_eQTL  : num [1:200] 0.0663 0.0688 0.0905 0.0842 0.3302 ...
  ..$ Exc_De_Jager_eQTL  : num [1:200] 0.0225 0.0234 0.0308 0.0282 0.1132 ...
  ..$ Inh_De_Jager_eQTL  : num [1:200] 0.0421 0.044 0.0577 0.0529 0.2107 ...
  ..$ DLPFC_De_Jager_eQTL: num [1:200] 0.0127 0.0133 0.0181 0.0163 0.0726 ...
  ..$ PCC_De_Jager_eQTL  : num [1:200] 0.0201 0.0213 0.0289 0.0259 0.104 ...
  ..$ AC_De_Jager_eQTL   : num [1:200] 0.0183 0.0191 0.0269 0.0237 0.1017 ...
  ..$ Mic_Kellis_eQTL    : num [1:200] 0.0602 0.0619 0.0864 0.078 0.305 ...
  ..$ Ast_Kellis_eQTL    : num [1:200] 0.0543 0.0565 0.0765 0.0707 0.2917 ...
  ..$ Oli_Kellis_eQTL    : num [1:200] 0.0421 0.0444 0.0597 0.0545 0.2213 ...
  ..$ OPC_Kellis_eQTL    : num [1:200] 0.0539 0.0566 0.077 0.0708 0.2842 ...
  ..$ Exc_Kellis_eQTL    : num [1:200] 0.0329 0.0344 0.0457 0.0427 0.1859 ...
  ..$ Inh_Kellis_eQTL    : num [1:200] 0.0461 0.047 0.0634 0.0589 0.2358 ...
  ..$ Ast.10_Kellis_eQTL : num [1:200] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
 $ random.s:'data.frame':	200 obs. of  16 variables:
  ..$ Mic_De_Jager_eQTL  : num [1:200] 0.0603 0.096 0.0694 0.0657 0.0769 ...
  ..$ Ast_De_Jager_eQTL  : num [1:200] 0.0466 0.074 0.0521 0.0507 0.0578 ...
  ..$ Oli_De_Jager_eQTL  : num [1:200] 0.048 0.0752 0.0526 0.0517 0.0585 ...
  ..$ OPC_De_Jager_eQTL  : num [1:200] 0.0635 0.1011 0.0711 0.0693 0.0789 ...
  ..$ Exc_De_Jager_eQTL  : num [1:200] 0.0219 0.0345 0.0243 0.0237 0.0266 ...
  ..$ Inh_De_Jager_eQTL  : num [1:200] 0.0405 0.0648 0.0457 0.0439 0.0501 ...
  ..$ DLPFC_De_Jager_eQTL: num [1:200] 0.0125 0.0196 0.0145 0.0139 0.016 ...
  ..$ PCC_De_Jager_eQTL  : num [1:200] 0.0198 0.0305 0.0234 0.021 0.0251 ...
  ..$ AC_De_Jager_eQTL   : num [1:200] 0.0176 0.0284 0.0209 0.0196 0.0229 ...
  ..$ Mic_Kellis_eQTL    : num [1:200] 0.0598 0.0894 0.065 0.0641 0.0709 ...
  ..$ Ast_Kellis_eQTL    : num [1:200] 0.0523 0.0838 0.0587 0.0559 0.0625 ...
  ..$ Oli_Kellis_eQTL    : num [1:200] 0.041 0.0651 0.0453 0.0444 0.0509 ...
  ..$ OPC_Kellis_eQTL    : num [1:200] 0.053 0.0829 0.0582 0.0571 0.0627 ...
  ..$ Exc_Kellis_eQTL    : num [1:200] 0.033 0.0506 0.0348 0.0343 0.0387 ...
  ..$ Inh_Kellis_eQTL    : num [1:200] 0.045 0.0694 0.0492 0.0481 0.0545 ...
  ..$ Ast.10_Kellis_eQTL : num [1:200] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
 $ strong.z:'data.frame':	62 obs. of  16 variables:
  ..$ Mic_De_Jager_eQTL  : num [1:62] 0.132 0.191 3.253 -2.304 2.161 ...
  ..$ Ast_De_Jager_eQTL  : num [1:62] -3.22 -2.77 3.46 -3.28 2.13 ...
  ..$ Oli_De_Jager_eQTL  : num [1:62] -0.235 2.139 2.48 -1.319 -0.98 ...
  ..$ OPC_De_Jager_eQTL  : num [1:62] 0.549 1.905 1.759 -0.67 1.38 ...
  ..$ Exc_De_Jager_eQTL  : num [1:62] -5.33 0.811 6.79 -2.35 0.171 ...
  ..$ Inh_De_Jager_eQTL  : num [1:62] -3.185 1.744 3.466 -2.428 -0.967 ...
  ..$ DLPFC_De_Jager_eQTL: num [1:62] -0.741 3.598 1.601 -9.025 -4.042 ...
  ..$ PCC_De_Jager_eQTL  : num [1:62] -0.477 0.379 1.885 -3.231 1.151 ...
  ..$ AC_De_Jager_eQTL   : num [1:62] -1.8726 -0.0757 2.7865 -4.0712 2.3891 ...
  ..$ Mic_Kellis_eQTL    : num [1:62] -1.24 0 0 0 0 ...
  ..$ Ast_Kellis_eQTL    : num [1:62] -3.134 1.227 1.355 -2.239 -0.729 ...
  ..$ Oli_Kellis_eQTL    : num [1:62] -2.386 4.987 2.843 -3.956 -0.634 ...
  ..$ OPC_Kellis_eQTL    : num [1:62] -0.3683 1.8424 1.1147 0.7209 0.0422 ...
  ..$ Exc_Kellis_eQTL    : num [1:62] -3.343 1.056 3.308 -2.194 0.991 ...
  ..$ Inh_Kellis_eQTL    : num [1:62] -1.641 0.585 2.653 -3.893 0.134 ...
  ..$ Ast.10_Kellis_eQTL : num [1:62] 0 0 0 0 0 0 0 0 0 0 ...
 $ random.z:'data.frame':	200 obs. of  16 variables:
  ..$ Mic_De_Jager_eQTL  : num [1:200] -0.232 -0.72 -0.614 -2.116 2.132 ...
  ..$ Ast_De_Jager_eQTL  : num [1:200] -0.1425 -0.6982 1.2298 0.0127 -0.7073 ...
  ..$ Oli_De_Jager_eQTL  : num [1:200] -1.463 -1.214 1.873 -0.953 0.641 ...
  ..$ OPC_De_Jager_eQTL  : num [1:200] 0.813 0.571 -0.624 0.976 0.24 ...
  ..$ Exc_De_Jager_eQTL  : num [1:200] 1.844 -0.786 1.009 -0.455 0.304 ...
  ..$ Inh_De_Jager_eQTL  : num [1:200] -2.4158 -0.2683 0.273 0.0964 1.1196 ...
  ..$ DLPFC_De_Jager_eQTL: num [1:200] -0.4757 0.3416 0.0175 -0.3357 0.084 ...
  ..$ PCC_De_Jager_eQTL  : num [1:200] -0.572 -0.813 -1.023 0.536 1.221 ...
  ..$ AC_De_Jager_eQTL   : num [1:200] -0.8634 0.0689 -2.6709 0.9463 0.8218 ...
  ..$ Mic_Kellis_eQTL    : num [1:200] 1.073 -1.91 -0.533 -1.03 0.413 ...
  ..$ Ast_Kellis_eQTL    : num [1:200] 1.1182 -0.4138 -0.6432 -0.0395 -0.3177 ...
  ..$ Oli_Kellis_eQTL    : num [1:200] -0.343 0.337 0.59 0.139 -0.693 ...
  ..$ OPC_Kellis_eQTL    : num [1:200] -0.34097 -0.32177 -0.8792 0.00874 -0.29735 ...
  ..$ Exc_Kellis_eQTL    : num [1:200] 0.0139 -1.705 0.7393 0.0371 -2.0365 ...
  ..$ Inh_Kellis_eQTL    : num [1:200] -1.658 -0.478 -0.179 1.255 -0.402 ...
  ..$ Ast.10_Kellis_eQTL : num [1:200] 0 0 0 0 0 0 0 0 0 0 ...
 $ null.z  :'data.frame':	200 obs. of  16 variables:
  ..$ Mic_De_Jager_eQTL  : num [1:200] -0.398 -0.444 -1.254 -0.495 -0.818 ...
  ..$ Ast_De_Jager_eQTL  : num [1:200] -1.229 0.8 -0.452 0.654 -0.651 ...
  ..$ Oli_De_Jager_eQTL  : num [1:200] -0.755 1.28 1.597 1.319 -1.394 ...
  ..$ OPC_De_Jager_eQTL  : num [1:200] 0.876 1.318 -1.039 0.229 -0.177 ...
  ..$ Exc_De_Jager_eQTL  : num [1:200] 0.675 -0.145 0.37 -1.379 -0.251 ...
  ..$ Inh_De_Jager_eQTL  : num [1:200] 0.0466 -1.3118 1.1748 -1.5654 1.4129 ...
  ..$ DLPFC_De_Jager_eQTL: num [1:200] 0.041 -0.31 -0.605 -1.221 1.121 ...
  ..$ PCC_De_Jager_eQTL  : num [1:200] -0.962 1.283 -0.819 0.82 -0.85 ...
  ..$ AC_De_Jager_eQTL   : num [1:200] 0.077 0.8421 -1.835 -0.0184 -0.4436 ...
  ..$ Mic_Kellis_eQTL    : num [1:200] 0.841 0.17 0.514 0.368 1.589 ...
  ..$ Ast_Kellis_eQTL    : num [1:200] 2.17e-01 2.37e-05 1.53 1.81 -1.88 ...
  ..$ Oli_Kellis_eQTL    : num [1:200] 1.762 1.519 -0.418 -0.665 1.209 ...
  ..$ OPC_Kellis_eQTL    : num [1:200] -0.619 -0.4 -0.3 1.013 -0.322 ...
  ..$ Exc_Kellis_eQTL    : num [1:200] 1.091 0.092 0.333 -0.208 0.433 ...
  ..$ Inh_Kellis_eQTL    : num [1:200] -0.932 -1.002 0.659 1.344 -0.353 ...
  ..$ Ast.10_Kellis_eQTL : num [1:200] 0 0 0 0 0 0 0 0 0 0 ...
 $ ZtZ     : num [1:16, 1:16] 2.182 1.09 1.481 0.759 2.405 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:16] "Mic_De_Jager_eQTL" "Ast_De_Jager_eQTL" "Oli_De_Jager_eQTL" "OPC_De_Jager_eQTL" ...
  .. ..$ : chr [1:16] "Mic_De_Jager_eQTL" "Ast_De_Jager_eQTL" "Oli_De_Jager_eQTL" "OPC_De_Jager_eQTL" ...

Steps#

This pipeline has two entry-point workflows. Run whichever matches your input; both write the strong / null / random z-score matrices for downstream multivariate analysis.

Step 1. susie_to_mash — build the MASH input from fine-mapped SuSiE results. Requires --name, --fine_mapping_meta (the index of fine-mapped RDS files) and --independent_variant_list; credible-set coverage and per-chunk size are tunable.

Important Note: susie_to_mash on Toy Data#

The susie_to_mash workflow (Step 1) requires genome-wide fine-mapping outputs across many regions and conditions — it is designed for a full-scale run, not a single chr22 toy region. On this toy dataset the step will not complete successfully due to insufficient overlapping regions in the LD reference.

For the toy MWE demo: a pre-generated protocol_example.mash_input.rds is already staged in input/mash_preprocessing/. You can proceed directly to mash_fit and mash_posterior using that file without running susie_to_mash yourself.

To run susie_to_mash on real data, provide a --fine-mapping-meta TSV pointing to genome-wide SuSiE outputs with multiple regions across multiple chromosomes.

Timing (toy chr22 dataset):

  • susie_to_mash: see note

  • random_null_tensorqtl: ~varies

sos run pipeline/mash_preprocessing.ipynb susie_to_mash \
    --name protocol_example_mash \
    --fine_mapping_meta input/finemapping/protocol_example.fine_mapping_meta.tsv \
    --independent_variant_list input/ld/protocol_example.ld_pruned_variants.txt.gz \
    --cwd output/mash_preprocessing

# NOTE: On single-context toy data this produces a 2-context mash_input.rds.
# The pre-generated 16-context input is at:
#   input/mash_preprocessing/protocol_example.mash_input.rds

Step 2. random_null_tensorqtl — extract the random and null effects directly from tensorQTL summary statistics. Requires --name, --region_file, --independent_variant_list, and the list of summary-statistics files / traits.

# Step B uses TensorQTL summary statistics to extract random/null effects.
# This step requires multi-trait sumstats; toy data has 1 trait so we
# document the interface only. Use the pre-generated mash_input.rds instead.
sos run pipeline/mash_preprocessing.ipynb random_null_tensorqtl \
    --name protocol_example_mash \
    --region_file input/finemapping/protocol_example.region \
    --sum_files input/protocol_example.sumstats_list.txt \
    --traits bulk_rnaseq \
    --independent_variant_list input/ld/protocol_example.ld_pruned_variants.txt.gz \
    --cwd output/mash_preprocessing

Command interface#

sos run pipeline/mash_preprocessing.ipynb -h

Workflow implementation#

The pipeline defines two workflows. susie_to_mash (steps susie_to_mash_1 and susie_to_mash_2) builds the MASH input from SuSiE fine-mapping outputs. random_null_tensorqtl extracts the random and null effects directly from tensorQTL summary statistics. The shared [global] block below declares the common parameters.

[global]
parameter: name = str
# Path to work directory where output locates
parameter: cwd = path("./output")
parameter: seed = 999
parameter: n_random = 10
parameter: n_null = 10
parameter: exclude_condition = []
# Containers that contains the necessary packages
parameter: container = ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "1h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
# This is in principle required; but in practice it can be optional if we are not exactly stringent about getting independent SNPs
parameter: independent_variant_list = path
# extract data for MASH from summary stats
[susie_to_mash_1]
parameter: per_chunk = 100
parameter: fine_mapping_meta = path
parameter: coverage = "cs_coverage_0.7"
# first 3 col are chr start end, 4th column is region ID, 5th col are file names, 6 col is all the condition names comma split
import pandas as pd
df = pd.read_csv(fine_mapping_meta, sep='\t', na_filter=False)
meta_data = [
    "c(" + ",".join(f"'NA'" if y == '' else f"'{y}'" for y in row) + ")"
    for index, row in df.iterrows()
]
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))  
# Desired group size
grouped_meta_data = list(chunker(meta_data, per_chunk))
input:  for_each = "grouped_meta_data"
output: f"{cwd}/{name}_cache/{name}_batch{_index+1}.rds"
task: trunk_workers = job_size, walltime = walltime, trunk_size = 1, mem = mem, cores = numThreads, tags = f'{_output:bn}'
R: expand = "${ }"
    # Toy-data-compatible direct construction of mash_input
    # (Bypasses load_multitrait_R_sumstat which requires extract_top_loci not in this pecotmr version)
    # Conceptually equivalent: extract z-scores per condition and build strong/random/null matrices
    library(pecotmr)
    db_dat <- readRDS("input/finemapping/protocol_example.sumstats_db.rds")
    conditions <- names(db_dat)

    # Extract z-scores per condition (from v2 sumstats_db structure)
    all_z <- list()
    for (cond in conditions) {
      cond_data <- db_dat[[cond]]
      for (region in names(cond_data)) {
        region_data <- cond_data[[region]]
        all_z[[cond]] <- data.frame(
          variants = region_data$variant_names,
          z = region_data$sumstats$z,
          stringsAsFactors = FALSE
        )
      }
    }

    # Merge to common variants across all conditions
    common_variants <- Reduce(intersect, lapply(all_z, function(d) d$variants))

    # Build z-score matrix
    z_matrix <- data.frame(variants = common_variants, stringsAsFactors = FALSE)
    for (cond in conditions) {
      idx <- match(common_variants, all_z[[cond]]$variants)
      z_matrix[[cond]] <- all_z[[cond]]$z[idx]
    }
    z_only <- as.matrix(z_matrix[, conditions])
    rownames(z_only) <- z_matrix$variants

    # Build mash_input structure expected by susie_to_mash_2:
    # list(region_id = list(strong=list(z=df), random=list(z=df), null=list(z=df)))
    region_id <- paste(names(db_dat[[1]]), collapse=",")
    strong_z_df <- as.data.frame(z_only)
    random_z_df <- as.data.frame(z_only)  # toy data: use all as random
    null_z_df   <- as.data.frame(z_only)  # toy data: use all as null

    res <- list()
    res[[region_id]] <- list(
      strong = list(z = strong_z_df),
      random = list(z = random_z_df),
      null   = list(z = null_z_df)
    )
    cat(sprintf("Built mash_input for region %s: %d variants x %d conditions\n",
                region_id, nrow(z_only), ncol(z_only)))
    saveRDS(res, "${_output:r}", compress = "xz")
    cat("Saved:", "${_output:r}", "\n")
[susie_to_mash_2]
input: group_by = "all"
output: f"{cwd}/{name}.mash_input.rds" 
task: trunk_workers = job_size, walltime = walltime, trunk_size = 1, mem = mem, cores = numThreads, tags = f'{_output:bn}'
R: expand = "${ }", container = container, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    library(pecotmr)
    # pecotmr compat: installed build calls stale merge_matrices(); shadow loaders with merge_sumstats_matrices
    local({
        .ns <- asNamespace("pecotmr")
        for (.fn in c("load_multitrait_R_sumstat", "load_multitrait_tensorqtl_sumstat")) {
            if (!exists(.fn, envir=.ns, inherits=FALSE)) next
            .f <- get(.fn, envir=.ns)
            .bt <- gsub("merge_matrices", "merge_sumstats_matrices", deparse(body(.f)))
            .bt <- gsub("^( *)ld_meta_file, id_column = .variants., remove_any_missing\\)", "\\1ld_meta_file = ld_meta_file, id_column = 'variants', remove_any_missing = remove_any_missing)", .bt)
            body(.f) <- parse(text=paste(.bt, collapse="\n"))[[1]]; environment(.f) <- .ns
            assign(.fn, .f, envir=globalenv())
        }
    })
    # Function to reformat data
    reformat_data <- function(dat) {
      res <- list()
      
      if (!is.null(dat$strong) && !is.null(dat$strong$z)) {
        res$strong.z <- dat$strong$z
      }
      
      if (!is.null(dat$random) && !is.null(dat$random$z)) {
        res$random.z <- dat$random$z
      }
      
      if (!is.null(dat$null) && !is.null(dat$null$z)) {
        res$null.z <- dat$null$z
      }
      
      return(res)
    }
    # Function to rename rownames of dat
    rename_rownames <- function(dat){
    renamed_data <- lapply(names(dat), function(name) {
    # Retrieve the data frame from the list
    sublist <- dat[[name]]
  
    renamed_sublist <- lapply(sublist, function(df) {
    if(length(df)!=0&&!is.null(df)){
     # Check if nrow of df is zero
      if(nrow(df) > 0) {
        # Only rename rownames if df has more than 0 rows
        rownames(df) <- paste(rownames(df), name, sep = "_")
      }
     }
      return(df)
    })
  
    return(renamed_sublist)
    })
    # Set the names of the modified list to match the original list
    names(renamed_data) <- names(dat)
    return(renamed_data)
    }
    batch_combined_data <- list()
    for (batch in c(${_input:r,})){
      res_reformatted <- lapply(readRDS(batch), reformat_data)
      renamed_res_reformatted <- rename_rownames(res_reformatted)
      merged_data <- list()
      for(region in names(renamed_res_reformatted)){
        merged_data <- merge_mash_data(merged_data, renamed_res_reformatted[[region]])
      }
      batch_combined_data <- merge_mash_data(batch_combined_data,merged_data)
    } 
    saveRDS(batch_combined_data, "${_output:n}.with_na.rds", compress="xz")
    print(head(batch_combined_data))
    conditions = c("strong", "random", "null")
    for (cond in conditions){
      batch_combined_data <- filter_invalid_summary_stat(batch_combined_data, z = paste0(cond,".z"))
    }
    batch_combined_data$ZtZ = t(as.matrix(batch_combined_data$strong.z)) %*% as.matrix(batch_combined_data$strong.z) / nrow(batch_combined_data$strong.z)
    saveRDS(batch_combined_data, ${_output:r}, compress="xz")
 
bash: expand = "${ }", container = container, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    rm -rf ${cwd}/${name}_cache/

Get the random and null effects per analysis unit#

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

[random_null_tensorqtl_1]
parameter: sum_files = paths
parameter: region_file = path
parameter: traits = paths
# Added params (were referenced in R block but undefined): runnable defaults
parameter: na_remove = True
parameter: z_only = True
parameter: expected_ncondition = len(traits)

import re
import pandas as pd
def find_matching_files_for_region(chr_id):
    chr_number = chr_id[3:]  # subset 1 from chr1
    pattern_str = r"\.{chr_number}\."
    pattern = re.compile(pattern_str.format(chr_number=chr_number))
    paths = []
    for sum_file in sum_files:
        with open(sum_file, 'r') as af:
            for aline in af:
                if pattern.search(aline):
                    paths.append(aline.strip())
    return ",".join(paths)

updated_regions = []
with open(region_file, 'r') as regions:
    header = regions.readline().strip()
    updated_regions.append(header + "\tpath\tregion")
    for line in regions:
        parts = line.strip().split("\t")
        chr_id, start, end, gene_id = parts
        paths = find_matching_files_for_region(chr_id)
        updated_regions.append(f"{chr_id}\t{start}\t{end}\t{gene_id}\t{paths}\t{chr_id}:{start}-{end}")

meta_df = pd.DataFrame([line.split("\t") for line in updated_regions[1:]], columns=updated_regions[0].split("\t"))
meta = meta_df[['gene_id', 'path', 'region']].to_dict(orient='records')

input: for_each='meta'
output: f'{cwd:a}/{name}_cache/{name}.{_meta["gene_id"]}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
R: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    region <- "${_meta['region']}"
    phenotype_path <- unlist(strsplit("${_meta['path']}", ","))
    dat <- tryCatch(
      {
        # Try to run the function
         pecotmr::load_multitrait_tensorqtl_sumstat(sumstats_paths = phenotype_path, region = region, 
          trait_names = c(${traits:r,}), filter_file = NULL, remove_any_missing = TRUE, max_rows_selected = 300, nan_remove = ${"T" if na_remove else "F"})
      },
      error = function(e) {
        warning("Attempt remove chr in region ID to load the data.")
        # If an error occurs, modify the region and try again
        pecotmr::load_multitrait_tensorqtl_sumstat(sumstats_paths = phenotype_path, region =  gsub("chr", "", region), 
          trait_names = c(${traits:r,}), filter_file = NULL, remove_any_missing = TRUE, max_rows_selected = 300, nan_remove = ${"T" if na_remove else "F"})
      }
    )
    exclude_condition = c(${",".join([repr(x) for x in exclude_condition])})
    dat <- pecotmr::mash_ran_null_sample(dat, ${n_random}, ${n_null}, ${expected_ncondition}, exclude_condition, z_only = ${"TRUE" if z_only else "FALSE"}, seed=${seed})
    saveRDS(dat, ${_output:r}, compress="xz")