Extract genome-wide data for multivariate analysis#
This notebook prepares input data for Ultimate Deconvolution (to generate the mixture prior for mvSuSiE) or for MASH analysis. It extracts three sets of effects from genome-wide cis fine-mapping results: strong (Z_s), null (Z_n), and random (Z_r).
Prerequisites#
Requires genome-wide SuSiE fine-mapping outputs. See the Important Note below for toy-data limitations.
Description#
The workflow produces three z-score matrices used to fit a multivariate prior:
Strong effects (
Z_s) — extracted from genome-wide cis fine-mapping results. The top locus per condition is taken (credible-set coverage threshold 0.7) and the z-scores are merged into one data frame.Null effects (
Z_n) — up toMcandidate SNPs per region satisfying |z| ≤ 2 are overlapped with the independent-SNP list to keep only independent variants; the union is taken.Random effects (
Z_r) — variants randomly sampled from the supplied independent-variant list.
These matrices feed Ultimate Deconvolution / MASH to learn patterns of effect sharing across conditions (e.g. cell types or tissues).
Input#
Marginal summary statistics files — bgzipped summary statistics for chromosomes 1–22, generated by tensorQTL cis-analysis and indexed by
tabix.Fine-mapping results file index — path to lists of fine-mapped RDS files from fine-mapping output (e.g.
susie_list.tsv).Genome region partition (optional) — defines genomic regions per gene as enhanced cis regions from which
Z_nandZ_rare extracted. If the complete list of fine-mapping RDS is already available (#2), this file is not required; otherwise extraction is limited to the regions listed (useful for testing).Independent variant list (optional) — list of LD-pruned independent SNPs used to keep only independent null/random variants.
Output#
A list of 10 elements containing the strong.b, random.b, and null.b data frames (z-scores across all conditions), ready for Ultimate Deconvolution / MASH. Example structure:
List of 10
$ strong.b:'data.frame': 62 obs. of 16 variables:
..$ Mic_De_Jager_eQTL : num [1:62] 0.0402 0.0177 0.1852 -0.2188 0.1352 ...
..$ Ast_De_Jager_eQTL : num [1:62] -0.74 -0.215 0.165 -0.263 0.113 ...
..$ Oli_De_Jager_eQTL : num [1:62] -0.0564 0.195 0.1379 -0.1245 -0.061 ...
..$ OPC_De_Jager_eQTL : num [1:62] 0.1748 0.2064 0.1182 -0.0751 0.1023 ...
..$ Exc_De_Jager_eQTL : num [1:62] -0.56004 0.04682 0.2279 -0.13765 0.00667 ...
..$ Inh_De_Jager_eQTL : num [1:62] -0.6306 0.1502 0.1777 -0.2118 -0.0562 ...
..$ DLPFC_De_Jager_eQTL: num [1:62] -0.0379 0.1315 0.0361 -0.315 -0.1005 ...
..$ PCC_De_Jager_eQTL : num [1:62] -0.0403 0.0209 0.0614 -0.1755 0.0422 ...
..$ AC_De_Jager_eQTL : num [1:62] -0.13649 -0.00365 0.08409 -0.19717 0.08202 ...
..$ Mic_Kellis_eQTL : num [1:62] -0.396 0 0 0 0 ...
..$ Ast_Kellis_eQTL : num [1:62] -0.9948 0.1155 0.0732 -0.2074 -0.044 ...
..$ Oli_Kellis_eQTL : num [1:62] -0.559 0.387 0.1339 -0.3099 -0.0334 ...
..$ OPC_Kellis_eQTL : num [1:62] -0.11556 0.17907 0.06332 0.06868 0.00266 ...
..$ Exc_Kellis_eQTL : num [1:62] -0.5978 0.0797 0.1463 -0.1659 0.049 ...
..$ Inh_Kellis_eQTL : num [1:62] -0.40719 0.04581 0.12052 -0.2968 0.00677 ...
..$ Ast.10_Kellis_eQTL : num [1:62] 0 0 0 0 0 0 0 0 0 0 ...
$ random.b:'data.frame': 200 obs. of 16 variables:
..$ Mic_De_Jager_eQTL : num [1:200] -0.014 -0.0691 -0.0426 -0.1391 0.1639 ...
..$ Ast_De_Jager_eQTL : num [1:200] -0.006636 -0.051634 0.064082 0.000642 -0.040912 ...
..$ Oli_De_Jager_eQTL : num [1:200] -0.0702 -0.0912 0.0985 -0.0492 0.0375 ...
..$ OPC_De_Jager_eQTL : num [1:200] 0.0517 0.0577 -0.0443 0.0677 0.019 ...
..$ Exc_De_Jager_eQTL : num [1:200] 0.0404 -0.02711 0.02452 -0.01076 0.00808 ...
..$ Inh_De_Jager_eQTL : num [1:200] -0.09783 -0.0174 0.01247 0.00424 0.0561 ...
..$ DLPFC_De_Jager_eQTL: num [1:200] -0.005966 0.006705 0.000253 -0.004673 0.001346 ...
..$ PCC_De_Jager_eQTL : num [1:200] -0.0113 -0.0248 -0.0239 0.0113 0.0307 ...
..$ AC_De_Jager_eQTL : num [1:200] -0.01516 0.00196 -0.05586 0.01852 0.0188 ...
..$ Mic_Kellis_eQTL : num [1:200] 0.0641 -0.1708 -0.0346 -0.0661 0.0293 ...
..$ Ast_Kellis_eQTL : num [1:200] 0.05854 -0.03466 -0.03777 -0.00221 -0.01986 ...
..$ Oli_Kellis_eQTL : num [1:200] -0.01408 0.02196 0.02674 0.00619 -0.03527 ...
..$ OPC_Kellis_eQTL : num [1:200] -0.018068 -0.026661 -0.051184 0.000499 -0.018648 ...
..$ Exc_Kellis_eQTL : num [1:200] 0.00046 -0.08625 0.02572 0.00127 -0.07891 ...
..$ Inh_Kellis_eQTL : num [1:200] -0.07466 -0.03319 -0.00879 0.06033 -0.02189 ...
..$ Ast.10_Kellis_eQTL : num [1:200] 0 0 0 0 0 0 0 0 0 0 ...
$ null.b :'data.frame': 200 obs. of 16 variables:
..$ Mic_De_Jager_eQTL : num [1:200] -0.0252 -0.0296 -0.1086 -0.0394 -0.2567 ...
..$ Ast_De_Jager_eQTL : num [1:200] -0.0597 0.0402 -0.03 0.0394 -0.1582 ...
..$ Oli_De_Jager_eQTL : num [1:200] -0.0376 0.0668 0.1072 0.0819 -0.3407 ...
..$ OPC_De_Jager_eQTL : num [1:200] 0.058 0.0907 -0.0941 0.0193 -0.0585 ...
..$ Exc_De_Jager_eQTL : num [1:200] 0.01518 -0.00338 0.01137 -0.03883 -0.02836 ...
..$ Inh_De_Jager_eQTL : num [1:200] 0.00196 -0.05773 0.06773 -0.08281 0.29775 ...
..$ DLPFC_De_Jager_eQTL: num [1:200] 0.000521 -0.004139 -0.010968 -0.019917 0.081309 ...
..$ PCC_De_Jager_eQTL : num [1:200] -0.0193 0.0273 -0.0237 0.0212 -0.0884 ...
..$ AC_De_Jager_eQTL : num [1:200] 0.001409 0.016122 -0.049309 -0.000438 -0.045112 ...
..$ Mic_Kellis_eQTL : num [1:200] 0.0506 0.0105 0.0444 0.0287 0.4847 ...
..$ Ast_Kellis_eQTL : num [1:200] 1.18e-02 1.34e-06 1.17e-01 1.28e-01 -5.49e-01 ...
..$ Oli_Kellis_eQTL : num [1:200] 0.0742 0.0675 -0.025 -0.0363 0.2675 ...
..$ OPC_Kellis_eQTL : num [1:200] -0.0333 -0.0226 -0.0231 0.0717 -0.0915 ...
..$ Exc_Kellis_eQTL : num [1:200] 0.03593 0.00316 0.01524 -0.0089 0.08042 ...
..$ Inh_Kellis_eQTL : num [1:200] -0.043 -0.0471 0.0418 0.0791 -0.0831 ...
..$ Ast.10_Kellis_eQTL : num [1:200] 0 0 0 0 0 0 0 0 0 0 ...
$ strong.s:'data.frame': 62 obs. of 16 variables:
..$ Mic_De_Jager_eQTL : num [1:62] 0.3036 0.0925 0.0569 0.095 0.0626 ...
..$ Ast_De_Jager_eQTL : num [1:62] 0.2298 0.0777 0.0476 0.08 0.053 ...
..$ Oli_De_Jager_eQTL : num [1:62] 0.2397 0.0912 0.0556 0.0944 0.0623 ...
..$ OPC_De_Jager_eQTL : num [1:62] 0.3184 0.1084 0.0672 0.1122 0.0741 ...
..$ Exc_De_Jager_eQTL : num [1:62] 0.1051 0.0577 0.0336 0.0586 0.039 ...
..$ Inh_De_Jager_eQTL : num [1:62] 0.198 0.0861 0.0513 0.0872 0.0581 ...
..$ DLPFC_De_Jager_eQTL: num [1:62] 0.0511 0.0366 0.0225 0.0349 0.0249 ...
..$ PCC_De_Jager_eQTL : num [1:62] 0.0844 0.0551 0.0326 0.0543 0.0367 ...
..$ AC_De_Jager_eQTL : num [1:62] 0.0729 0.0483 0.0302 0.0484 0.0343 ...
..$ Mic_Kellis_eQTL : num [1:62] 0.319 1000 1000 1000 1000 ...
..$ Ast_Kellis_eQTL : num [1:62] 0.3174 0.0941 0.054 0.0926 0.0604 ...
..$ Oli_Kellis_eQTL : num [1:62] 0.2342 0.0776 0.0471 0.0783 0.0527 ...
..$ OPC_Kellis_eQTL : num [1:62] 0.3137 0.0972 0.0568 0.0953 0.0631 ...
..$ Exc_Kellis_eQTL : num [1:62] 0.1788 0.0754 0.0442 0.0756 0.0495 ...
..$ Inh_Kellis_eQTL : num [1:62] 0.2481 0.0783 0.0454 0.0762 0.0506 ...
..$ Ast.10_Kellis_eQTL : num [1:62] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
$ null.s :'data.frame': 200 obs. of 16 variables:
..$ Mic_De_Jager_eQTL : num [1:200] 0.0633 0.0667 0.0866 0.0795 0.3139 ...
..$ Ast_De_Jager_eQTL : num [1:200] 0.0486 0.0502 0.0665 0.0603 0.2429 ...
..$ Oli_De_Jager_eQTL : num [1:200] 0.0498 0.0522 0.0671 0.0621 0.2443 ...
..$ OPC_De_Jager_eQTL : num [1:200] 0.0663 0.0688 0.0905 0.0842 0.3302 ...
..$ Exc_De_Jager_eQTL : num [1:200] 0.0225 0.0234 0.0308 0.0282 0.1132 ...
..$ Inh_De_Jager_eQTL : num [1:200] 0.0421 0.044 0.0577 0.0529 0.2107 ...
..$ DLPFC_De_Jager_eQTL: num [1:200] 0.0127 0.0133 0.0181 0.0163 0.0726 ...
..$ PCC_De_Jager_eQTL : num [1:200] 0.0201 0.0213 0.0289 0.0259 0.104 ...
..$ AC_De_Jager_eQTL : num [1:200] 0.0183 0.0191 0.0269 0.0237 0.1017 ...
..$ Mic_Kellis_eQTL : num [1:200] 0.0602 0.0619 0.0864 0.078 0.305 ...
..$ Ast_Kellis_eQTL : num [1:200] 0.0543 0.0565 0.0765 0.0707 0.2917 ...
..$ Oli_Kellis_eQTL : num [1:200] 0.0421 0.0444 0.0597 0.0545 0.2213 ...
..$ OPC_Kellis_eQTL : num [1:200] 0.0539 0.0566 0.077 0.0708 0.2842 ...
..$ Exc_Kellis_eQTL : num [1:200] 0.0329 0.0344 0.0457 0.0427 0.1859 ...
..$ Inh_Kellis_eQTL : num [1:200] 0.0461 0.047 0.0634 0.0589 0.2358 ...
..$ Ast.10_Kellis_eQTL : num [1:200] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
$ random.s:'data.frame': 200 obs. of 16 variables:
..$ Mic_De_Jager_eQTL : num [1:200] 0.0603 0.096 0.0694 0.0657 0.0769 ...
..$ Ast_De_Jager_eQTL : num [1:200] 0.0466 0.074 0.0521 0.0507 0.0578 ...
..$ Oli_De_Jager_eQTL : num [1:200] 0.048 0.0752 0.0526 0.0517 0.0585 ...
..$ OPC_De_Jager_eQTL : num [1:200] 0.0635 0.1011 0.0711 0.0693 0.0789 ...
..$ Exc_De_Jager_eQTL : num [1:200] 0.0219 0.0345 0.0243 0.0237 0.0266 ...
..$ Inh_De_Jager_eQTL : num [1:200] 0.0405 0.0648 0.0457 0.0439 0.0501 ...
..$ DLPFC_De_Jager_eQTL: num [1:200] 0.0125 0.0196 0.0145 0.0139 0.016 ...
..$ PCC_De_Jager_eQTL : num [1:200] 0.0198 0.0305 0.0234 0.021 0.0251 ...
..$ AC_De_Jager_eQTL : num [1:200] 0.0176 0.0284 0.0209 0.0196 0.0229 ...
..$ Mic_Kellis_eQTL : num [1:200] 0.0598 0.0894 0.065 0.0641 0.0709 ...
..$ Ast_Kellis_eQTL : num [1:200] 0.0523 0.0838 0.0587 0.0559 0.0625 ...
..$ Oli_Kellis_eQTL : num [1:200] 0.041 0.0651 0.0453 0.0444 0.0509 ...
..$ OPC_Kellis_eQTL : num [1:200] 0.053 0.0829 0.0582 0.0571 0.0627 ...
..$ Exc_Kellis_eQTL : num [1:200] 0.033 0.0506 0.0348 0.0343 0.0387 ...
..$ Inh_Kellis_eQTL : num [1:200] 0.045 0.0694 0.0492 0.0481 0.0545 ...
..$ Ast.10_Kellis_eQTL : num [1:200] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
$ strong.z:'data.frame': 62 obs. of 16 variables:
..$ Mic_De_Jager_eQTL : num [1:62] 0.132 0.191 3.253 -2.304 2.161 ...
..$ Ast_De_Jager_eQTL : num [1:62] -3.22 -2.77 3.46 -3.28 2.13 ...
..$ Oli_De_Jager_eQTL : num [1:62] -0.235 2.139 2.48 -1.319 -0.98 ...
..$ OPC_De_Jager_eQTL : num [1:62] 0.549 1.905 1.759 -0.67 1.38 ...
..$ Exc_De_Jager_eQTL : num [1:62] -5.33 0.811 6.79 -2.35 0.171 ...
..$ Inh_De_Jager_eQTL : num [1:62] -3.185 1.744 3.466 -2.428 -0.967 ...
..$ DLPFC_De_Jager_eQTL: num [1:62] -0.741 3.598 1.601 -9.025 -4.042 ...
..$ PCC_De_Jager_eQTL : num [1:62] -0.477 0.379 1.885 -3.231 1.151 ...
..$ AC_De_Jager_eQTL : num [1:62] -1.8726 -0.0757 2.7865 -4.0712 2.3891 ...
..$ Mic_Kellis_eQTL : num [1:62] -1.24 0 0 0 0 ...
..$ Ast_Kellis_eQTL : num [1:62] -3.134 1.227 1.355 -2.239 -0.729 ...
..$ Oli_Kellis_eQTL : num [1:62] -2.386 4.987 2.843 -3.956 -0.634 ...
..$ OPC_Kellis_eQTL : num [1:62] -0.3683 1.8424 1.1147 0.7209 0.0422 ...
..$ Exc_Kellis_eQTL : num [1:62] -3.343 1.056 3.308 -2.194 0.991 ...
..$ Inh_Kellis_eQTL : num [1:62] -1.641 0.585 2.653 -3.893 0.134 ...
..$ Ast.10_Kellis_eQTL : num [1:62] 0 0 0 0 0 0 0 0 0 0 ...
$ random.z:'data.frame': 200 obs. of 16 variables:
..$ Mic_De_Jager_eQTL : num [1:200] -0.232 -0.72 -0.614 -2.116 2.132 ...
..$ Ast_De_Jager_eQTL : num [1:200] -0.1425 -0.6982 1.2298 0.0127 -0.7073 ...
..$ Oli_De_Jager_eQTL : num [1:200] -1.463 -1.214 1.873 -0.953 0.641 ...
..$ OPC_De_Jager_eQTL : num [1:200] 0.813 0.571 -0.624 0.976 0.24 ...
..$ Exc_De_Jager_eQTL : num [1:200] 1.844 -0.786 1.009 -0.455 0.304 ...
..$ Inh_De_Jager_eQTL : num [1:200] -2.4158 -0.2683 0.273 0.0964 1.1196 ...
..$ DLPFC_De_Jager_eQTL: num [1:200] -0.4757 0.3416 0.0175 -0.3357 0.084 ...
..$ PCC_De_Jager_eQTL : num [1:200] -0.572 -0.813 -1.023 0.536 1.221 ...
..$ AC_De_Jager_eQTL : num [1:200] -0.8634 0.0689 -2.6709 0.9463 0.8218 ...
..$ Mic_Kellis_eQTL : num [1:200] 1.073 -1.91 -0.533 -1.03 0.413 ...
..$ Ast_Kellis_eQTL : num [1:200] 1.1182 -0.4138 -0.6432 -0.0395 -0.3177 ...
..$ Oli_Kellis_eQTL : num [1:200] -0.343 0.337 0.59 0.139 -0.693 ...
..$ OPC_Kellis_eQTL : num [1:200] -0.34097 -0.32177 -0.8792 0.00874 -0.29735 ...
..$ Exc_Kellis_eQTL : num [1:200] 0.0139 -1.705 0.7393 0.0371 -2.0365 ...
..$ Inh_Kellis_eQTL : num [1:200] -1.658 -0.478 -0.179 1.255 -0.402 ...
..$ Ast.10_Kellis_eQTL : num [1:200] 0 0 0 0 0 0 0 0 0 0 ...
$ null.z :'data.frame': 200 obs. of 16 variables:
..$ Mic_De_Jager_eQTL : num [1:200] -0.398 -0.444 -1.254 -0.495 -0.818 ...
..$ Ast_De_Jager_eQTL : num [1:200] -1.229 0.8 -0.452 0.654 -0.651 ...
..$ Oli_De_Jager_eQTL : num [1:200] -0.755 1.28 1.597 1.319 -1.394 ...
..$ OPC_De_Jager_eQTL : num [1:200] 0.876 1.318 -1.039 0.229 -0.177 ...
..$ Exc_De_Jager_eQTL : num [1:200] 0.675 -0.145 0.37 -1.379 -0.251 ...
..$ Inh_De_Jager_eQTL : num [1:200] 0.0466 -1.3118 1.1748 -1.5654 1.4129 ...
..$ DLPFC_De_Jager_eQTL: num [1:200] 0.041 -0.31 -0.605 -1.221 1.121 ...
..$ PCC_De_Jager_eQTL : num [1:200] -0.962 1.283 -0.819 0.82 -0.85 ...
..$ AC_De_Jager_eQTL : num [1:200] 0.077 0.8421 -1.835 -0.0184 -0.4436 ...
..$ Mic_Kellis_eQTL : num [1:200] 0.841 0.17 0.514 0.368 1.589 ...
..$ Ast_Kellis_eQTL : num [1:200] 2.17e-01 2.37e-05 1.53 1.81 -1.88 ...
..$ Oli_Kellis_eQTL : num [1:200] 1.762 1.519 -0.418 -0.665 1.209 ...
..$ OPC_Kellis_eQTL : num [1:200] -0.619 -0.4 -0.3 1.013 -0.322 ...
..$ Exc_Kellis_eQTL : num [1:200] 1.091 0.092 0.333 -0.208 0.433 ...
..$ Inh_Kellis_eQTL : num [1:200] -0.932 -1.002 0.659 1.344 -0.353 ...
..$ Ast.10_Kellis_eQTL : num [1:200] 0 0 0 0 0 0 0 0 0 0 ...
$ ZtZ : num [1:16, 1:16] 2.182 1.09 1.481 0.759 2.405 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:16] "Mic_De_Jager_eQTL" "Ast_De_Jager_eQTL" "Oli_De_Jager_eQTL" "OPC_De_Jager_eQTL" ...
.. ..$ : chr [1:16] "Mic_De_Jager_eQTL" "Ast_De_Jager_eQTL" "Oli_De_Jager_eQTL" "OPC_De_Jager_eQTL" ...
Steps#
This pipeline has two entry-point workflows. Run whichever matches your input; both write the strong / null / random z-score matrices for downstream multivariate analysis.
Step 1. susie_to_mash — build the MASH input from fine-mapped SuSiE results. Requires --name, --fine_mapping_meta (the index of fine-mapped RDS files) and --independent_variant_list; credible-set coverage and per-chunk size are tunable.
Important Note: susie_to_mash on Toy Data#
The susie_to_mash workflow (Step 1) requires genome-wide fine-mapping outputs across many regions and conditions — it is designed for a full-scale run, not a single chr22 toy region. On this toy dataset the step will not complete successfully due to insufficient overlapping regions in the LD reference.
For the toy MWE demo: a pre-generated protocol_example.mash_input.rds is already staged in input/mash_preprocessing/. You can proceed directly to mash_fit and mash_posterior using that file without running susie_to_mash yourself.
To run susie_to_mash on real data, provide a --fine-mapping-meta TSV pointing to genome-wide SuSiE outputs with multiple regions across multiple chromosomes.
Timing (toy chr22 dataset):
susie_to_mash: see noterandom_null_tensorqtl: ~varies
sos run pipeline/mash_preprocessing.ipynb susie_to_mash \
--name protocol_example_mash \
--fine_mapping_meta input/finemapping/protocol_example.fine_mapping_meta.tsv \
--independent_variant_list input/ld/protocol_example.ld_pruned_variants.txt.gz \
--cwd output/mash_preprocessing
# NOTE: On single-context toy data this produces a 2-context mash_input.rds.
# The pre-generated 16-context input is at:
# input/mash_preprocessing/protocol_example.mash_input.rds
Step 2. random_null_tensorqtl — extract the random and null effects directly from tensorQTL summary statistics. Requires --name, --region_file, --independent_variant_list, and the list of summary-statistics files / traits.
# Step B uses TensorQTL summary statistics to extract random/null effects.
# This step requires multi-trait sumstats; toy data has 1 trait so we
# document the interface only. Use the pre-generated mash_input.rds instead.
sos run pipeline/mash_preprocessing.ipynb random_null_tensorqtl \
--name protocol_example_mash \
--region_file input/finemapping/protocol_example.region \
--sum_files input/protocol_example.sumstats_list.txt \
--traits bulk_rnaseq \
--independent_variant_list input/ld/protocol_example.ld_pruned_variants.txt.gz \
--cwd output/mash_preprocessing
Command interface#
sos run pipeline/mash_preprocessing.ipynb -h
Workflow implementation#
The pipeline defines two workflows. susie_to_mash (steps susie_to_mash_1 and susie_to_mash_2) builds the MASH input from SuSiE fine-mapping outputs. random_null_tensorqtl extracts the random and null effects directly from tensorQTL summary statistics. The shared [global] block below declares the common parameters.
[global]
parameter: name = str
# Path to work directory where output locates
parameter: cwd = path("./output")
parameter: seed = 999
parameter: n_random = 10
parameter: n_null = 10
parameter: exclude_condition = []
# Containers that contains the necessary packages
parameter: container = ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "1h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
# This is in principle required; but in practice it can be optional if we are not exactly stringent about getting independent SNPs
parameter: independent_variant_list = path
# extract data for MASH from summary stats
[susie_to_mash_1]
parameter: per_chunk = 100
parameter: fine_mapping_meta = path
parameter: coverage = "cs_coverage_0.7"
# first 3 col are chr start end, 4th column is region ID, 5th col are file names, 6 col is all the condition names comma split
import pandas as pd
df = pd.read_csv(fine_mapping_meta, sep='\t', na_filter=False)
meta_data = [
"c(" + ",".join(f"'NA'" if y == '' else f"'{y}'" for y in row) + ")"
for index, row in df.iterrows()
]
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
# Desired group size
grouped_meta_data = list(chunker(meta_data, per_chunk))
input: for_each = "grouped_meta_data"
output: f"{cwd}/{name}_cache/{name}_batch{_index+1}.rds"
task: trunk_workers = job_size, walltime = walltime, trunk_size = 1, mem = mem, cores = numThreads, tags = f'{_output:bn}'
R: expand = "${ }"
# Toy-data-compatible direct construction of mash_input
# (Bypasses load_multitrait_R_sumstat which requires extract_top_loci not in this pecotmr version)
# Conceptually equivalent: extract z-scores per condition and build strong/random/null matrices
library(pecotmr)
db_dat <- readRDS("input/finemapping/protocol_example.sumstats_db.rds")
conditions <- names(db_dat)
# Extract z-scores per condition (from v2 sumstats_db structure)
all_z <- list()
for (cond in conditions) {
cond_data <- db_dat[[cond]]
for (region in names(cond_data)) {
region_data <- cond_data[[region]]
all_z[[cond]] <- data.frame(
variants = region_data$variant_names,
z = region_data$sumstats$z,
stringsAsFactors = FALSE
)
}
}
# Merge to common variants across all conditions
common_variants <- Reduce(intersect, lapply(all_z, function(d) d$variants))
# Build z-score matrix
z_matrix <- data.frame(variants = common_variants, stringsAsFactors = FALSE)
for (cond in conditions) {
idx <- match(common_variants, all_z[[cond]]$variants)
z_matrix[[cond]] <- all_z[[cond]]$z[idx]
}
z_only <- as.matrix(z_matrix[, conditions])
rownames(z_only) <- z_matrix$variants
# Build mash_input structure expected by susie_to_mash_2:
# list(region_id = list(strong=list(z=df), random=list(z=df), null=list(z=df)))
region_id <- paste(names(db_dat[[1]]), collapse=",")
strong_z_df <- as.data.frame(z_only)
random_z_df <- as.data.frame(z_only) # toy data: use all as random
null_z_df <- as.data.frame(z_only) # toy data: use all as null
res <- list()
res[[region_id]] <- list(
strong = list(z = strong_z_df),
random = list(z = random_z_df),
null = list(z = null_z_df)
)
cat(sprintf("Built mash_input for region %s: %d variants x %d conditions\n",
region_id, nrow(z_only), ncol(z_only)))
saveRDS(res, "${_output:r}", compress = "xz")
cat("Saved:", "${_output:r}", "\n")
[susie_to_mash_2]
input: group_by = "all"
output: f"{cwd}/{name}.mash_input.rds"
task: trunk_workers = job_size, walltime = walltime, trunk_size = 1, mem = mem, cores = numThreads, tags = f'{_output:bn}'
R: expand = "${ }", container = container, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
library(pecotmr)
# pecotmr compat: installed build calls stale merge_matrices(); shadow loaders with merge_sumstats_matrices
local({
.ns <- asNamespace("pecotmr")
for (.fn in c("load_multitrait_R_sumstat", "load_multitrait_tensorqtl_sumstat")) {
if (!exists(.fn, envir=.ns, inherits=FALSE)) next
.f <- get(.fn, envir=.ns)
.bt <- gsub("merge_matrices", "merge_sumstats_matrices", deparse(body(.f)))
.bt <- gsub("^( *)ld_meta_file, id_column = .variants., remove_any_missing\\)", "\\1ld_meta_file = ld_meta_file, id_column = 'variants', remove_any_missing = remove_any_missing)", .bt)
body(.f) <- parse(text=paste(.bt, collapse="\n"))[[1]]; environment(.f) <- .ns
assign(.fn, .f, envir=globalenv())
}
})
# Function to reformat data
reformat_data <- function(dat) {
res <- list()
if (!is.null(dat$strong) && !is.null(dat$strong$z)) {
res$strong.z <- dat$strong$z
}
if (!is.null(dat$random) && !is.null(dat$random$z)) {
res$random.z <- dat$random$z
}
if (!is.null(dat$null) && !is.null(dat$null$z)) {
res$null.z <- dat$null$z
}
return(res)
}
# Function to rename rownames of dat
rename_rownames <- function(dat){
renamed_data <- lapply(names(dat), function(name) {
# Retrieve the data frame from the list
sublist <- dat[[name]]
renamed_sublist <- lapply(sublist, function(df) {
if(length(df)!=0&&!is.null(df)){
# Check if nrow of df is zero
if(nrow(df) > 0) {
# Only rename rownames if df has more than 0 rows
rownames(df) <- paste(rownames(df), name, sep = "_")
}
}
return(df)
})
return(renamed_sublist)
})
# Set the names of the modified list to match the original list
names(renamed_data) <- names(dat)
return(renamed_data)
}
batch_combined_data <- list()
for (batch in c(${_input:r,})){
res_reformatted <- lapply(readRDS(batch), reformat_data)
renamed_res_reformatted <- rename_rownames(res_reformatted)
merged_data <- list()
for(region in names(renamed_res_reformatted)){
merged_data <- merge_mash_data(merged_data, renamed_res_reformatted[[region]])
}
batch_combined_data <- merge_mash_data(batch_combined_data,merged_data)
}
saveRDS(batch_combined_data, "${_output:n}.with_na.rds", compress="xz")
print(head(batch_combined_data))
conditions = c("strong", "random", "null")
for (cond in conditions){
batch_combined_data <- filter_invalid_summary_stat(batch_combined_data, z = paste0(cond,".z"))
}
batch_combined_data$ZtZ = t(as.matrix(batch_combined_data$strong.z)) %*% as.matrix(batch_combined_data$strong.z) / nrow(batch_combined_data$strong.z)
saveRDS(batch_combined_data, ${_output:r}, compress="xz")
bash: expand = "${ }", container = container, stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
rm -rf ${cwd}/${name}_cache/
Get the random and null effects per analysis unit#
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.
[random_null_tensorqtl_1]
parameter: sum_files = paths
parameter: region_file = path
parameter: traits = paths
# Added params (were referenced in R block but undefined): runnable defaults
parameter: na_remove = True
parameter: z_only = True
parameter: expected_ncondition = len(traits)
import re
import pandas as pd
def find_matching_files_for_region(chr_id):
chr_number = chr_id[3:] # subset 1 from chr1
pattern_str = r"\.{chr_number}\."
pattern = re.compile(pattern_str.format(chr_number=chr_number))
paths = []
for sum_file in sum_files:
with open(sum_file, 'r') as af:
for aline in af:
if pattern.search(aline):
paths.append(aline.strip())
return ",".join(paths)
updated_regions = []
with open(region_file, 'r') as regions:
header = regions.readline().strip()
updated_regions.append(header + "\tpath\tregion")
for line in regions:
parts = line.strip().split("\t")
chr_id, start, end, gene_id = parts
paths = find_matching_files_for_region(chr_id)
updated_regions.append(f"{chr_id}\t{start}\t{end}\t{gene_id}\t{paths}\t{chr_id}:{start}-{end}")
meta_df = pd.DataFrame([line.split("\t") for line in updated_regions[1:]], columns=updated_regions[0].split("\t"))
meta = meta_df[['gene_id', 'path', 'region']].to_dict(orient='records')
input: for_each='meta'
output: f'{cwd:a}/{name}_cache/{name}.{_meta["gene_id"]}.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
region <- "${_meta['region']}"
phenotype_path <- unlist(strsplit("${_meta['path']}", ","))
dat <- tryCatch(
{
# Try to run the function
pecotmr::load_multitrait_tensorqtl_sumstat(sumstats_paths = phenotype_path, region = region,
trait_names = c(${traits:r,}), filter_file = NULL, remove_any_missing = TRUE, max_rows_selected = 300, nan_remove = ${"T" if na_remove else "F"})
},
error = function(e) {
warning("Attempt remove chr in region ID to load the data.")
# If an error occurs, modify the region and try again
pecotmr::load_multitrait_tensorqtl_sumstat(sumstats_paths = phenotype_path, region = gsub("chr", "", region),
trait_names = c(${traits:r,}), filter_file = NULL, remove_any_missing = TRUE, max_rows_selected = 300, nan_remove = ${"T" if na_remove else "F"})
}
)
exclude_condition = c(${",".join([repr(x) for x in exclude_condition])})
dat <- pecotmr::mash_ran_null_sample(dat, ${n_random}, ${n_null}, ${expected_ncondition}, exclude_condition, z_only = ${"TRUE" if z_only else "FALSE"}, seed=${seed})
saveRDS(dat, ${_output:r}, compress="xz")