Stratified LD Score Regression (S-LDSC) Enrichment#
Minimal working-example driver for the S-LDSC functional-enrichment pipeline. The Steps section below gives one ready-to-run sos run command per workflow, using the toy inputs symlinked under input/.
Environment note. Steps 1–2 (
make_annotation_files_ldscore,get_heritability) wrap the external polyfun toolkit (compute_ldscores.py,ldsc.py,munge_polyfun_sumstats.py) and require pre-computed reference-panel files (baseline-LD scores, LD weights,.frq, and PLINK.bed/.bim/.fam). polyfun is not installed in this environment and the reference panel is not shipped with the toy example, so those two steps cannot be executed here; their commands are provided for use on a system where polyfun and a matching panel are available. Steps 3–4 (postprocess,meta_subset) usepecotmr::sldsc_postprocessing_pipeline(available here) and read the.results/.logfiles produced by Step 2.
Description#
This notebook implements the pipeline of S-LDSC for LD score and functional enrichment analysis.
Important: the S-LDSC implementation comes from the polyfun package, not the original LDSC from bulik/ldsc GitHub repo.
Uses GWAS summary statistics together with annotation and LD reference-panel data to compute per-SNP heritability enrichment for each annotation. It supports single-annotation (individual contribution) and joint multi-annotation (independent contribution) analysis.
Background. LD Score Regression (Bulik-Sullivan et al. 2015) distinguishes confounding (e.g. population stratification) from true polygenic signal by regressing GWAS chi-square statistics on LD scores: SNPs tagging more variation (high LD score) show higher chi-square under true polygenicity, whereas confounding inflates statistics independently of LD. S-LDSC (Finucane et al. 2015) partitions heritability across overlapping annotation categories; standardized tau accounts for negative selection (Gazal et al. 2017). The model details and the tau*/EnrichStat definitions follow below.
Methods - Workflow Overview#
The pipeline runs in three stages: (1) annotation preparation and the S-LDSC regression (polyfun), (2) post-processing into standardized \(\tau^*\) and meta-analysis (the pecotmr package), and (3) optional re-meta on user-defined trait subsets. The concrete commands for stages 1-2 are in the Steps section below.
Stage 1 - polyfun. Three SoS workflows wrap polyfun: make_annotation_files_ldscore converts target annotations into polyfun .annot.gz and runs compute_ldscores.py (toggles compute_single and compute_joint, both default True; the joint dir is only emitted when \(N \geq 2\)); munge_sumstats_polyfun preprocesses each GWAS into LDSC format; get_heritability runs polyfun’s ldsc.py once per --target-anno-dir, enforcing the MAF cutoff via --frqfile-chr (maf_cutoff accepts only 0 or 0.05).
Stage 2 - pecotmr post-processing. A single pecotmr::sldsc_postprocessing_pipeline call consumes all polyfun outputs: it extracts \(\tau\), \(E\), \(h^2_g\), EnrichStat p-value and per-block jackknife \(\tau\) values; computes \(sd_C\) and \(M_{\mathrm{ref}}\) over the regression’s MAF-cutoff SNP set; standardizes \(\tau \to \tau^*\) for single and joint modes; auto-detects binary vs continuous annotations; and runs a DerSimonian-Laird random-effects meta-analysis across traits, producing three meta tables (\(\tau^*\) cross-type comparable, \(E\) within-binary, EnrichStat within-binary). Output is an R list with per_trait and meta entries.
Stage 3 - subset meta-analysis. pecotmr::meta_sldsc_random re-runs the meta on a trait subset without re-running the regression (lightweight, interactive):
res <- readRDS("sldsc_results.rds")
neuro <- c("AD_GWAX", "PD_meta", "ALS_meta")
meta_neuro_taustar <- pecotmr::meta_sldsc_random(
res$per_trait[neuro], category = "my_target_anno", quantity = "tau_star"
)
Theory#
The statistical model behind the pipeline is summarized below. Because the same framework underlies several of the workflow steps, the model, its stratified extension, and the tau-estimation / enrichment definitions are described together here rather than repeated per step.
LDSC model#
Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to \(1/(p(1-p))\) where \(p\) is the minor allele frequency (MAF), the expected \(\chi^2\) statistic of variant \(j\) is:
where \(N\) is the sample size; \(M\) is the number of SNPs, so that \(h^2/M\) is the average heritability per SNP; \(a\) measures the contribution of confounding biases such as cryptic relatedness and population stratification; and \(\ell_j = \sum_k r^2_{jk}\) is the LD Score of variant \(j\), which measures the amount of genetic variation tagged by \(j\). A full derivation is given in the Supplementary Note of Bulik-Sullivan et al. (2015); an alternative derivation appears in the Supplementary Note of Zhu and Stephens (2017) AoAS.
Equation (1) shows that LD Score regression can compute SNP-based heritability for a phenotype from GWAS summary statistics alone, without requiring individual-level genotype data as REML and related methods do.
Stratified LDSC#
Heritability is the proportion of phenotypic variation that is due to variation in genetic values, and it can also be partitioned over disjoint or overlapping categories of SNPs.
Stratified LD Score Regression (S-LDSC) partitions heritability by leveraging both LD-score information and SNPs that have not reached genome-wide significance. S-LDSC exploits the fact that the \(\chi^2\) statistic for a given SNP reflects the cumulative effects of all SNPs tagged by it: in regions of high LD, the focal SNP captures the contribution of a group of nearby SNPs.
S-LDSC declares an annotation enriched for heritability if SNPs with high LD to that annotation have higher \(\chi^2\) statistics than SNPs with low LD to it.
Let \(a_{jC}\) denote the value of annotation \(C\) at SNP \(j\):
Binary annotation (e.g. an indicator for “in enhancer”, “in exon”, “in cell-type-specific peak”): \(a_{jC} \in \{0, 1\}\).
Continuous annotation (e.g. gene-specificity score, conservation score, continuous epigenomic signal): \(a_{jC} \in \mathbb{R}\).
Under a polygenic model the per-SNP heritability for SNP \(j\) is
and the expected \(\chi^2\) statistic of SNP \(j\) is
where \(\ell(j, C) = \sum_k a_{kC}\, r^2_{jk}\) is the partitioned LD score of SNP \(j\) with respect to annotation \(C\), and \(a\) measures confounding bias. Equation (2) allows joint estimation of all \(\tau_C\) via a (computationally simple) multiple regression of \(\chi^2_j\) against \(\ell(j, C)\).
Interpretation of \(\tau_C\):
Binary \(C\): \(\tau_C\) is the additive increase in per-SNP heritability for SNPs in category \(C\), on top of the contributions from any other annotations they belong to.
Continuous \(C\): \(\tau_C\) is the additive change in per-SNP heritability per unit increase in the value of annotation \(C\).
For application to real data and comparisons to other methods, see the three papers cited at the top of this notebook.
Tau Estimation and Enrichment Analysis#
Goal: quantify the contribution of functional annotations to trait heritability and assess statistical significance, accounting for LD structure and (for continuous annotations) annotation scale.
The pipeline has two computational layers:
Regression layer — the S-LDSC regression itself, performed by the polyfun engine. We do not re-implement this.
Post-processing layer — standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the
pecotmrR package (R/sldsc_wrapper.R).
The notation below tags each modeling quantity as (polyfun) or (pecotmr).
Notation#
For each annotation \(C\) we use:
\(\pi^{h^2}_C\) = proportion of trait heritability \(h^2_g\) assigned to annotation \(C\).
\(\pi^{M}_C\) = proportion of (effective) SNPs in annotation \(C\). For binary annotations this is \(M_C / M_{\mathrm{ref}}\); for continuous annotations it is the share of total annotation weight in \(C\).
Reference panel and MAF cutoff#
All LD-derived quantities — partitioned LD scores for the 97 baseline annotations and for our \(K\) target annotations, the LD-score-regression weights, allele frequencies, and the SNP set — are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); \(M_{\mathrm{ref}}\) throughout this notebook denotes the number of common SNPs in our panel.
By default we restrict to MAF \(> 5\%\) per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter maf_cutoff (default \(0.05\)); the regression, the standardized \(sd_C\), and \(M_{\mathrm{ref}}\) are all evaluated on the same MAF \(>\) cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set maf_cutoff = 0 to opt out (not recommended).
Quantities from the regression layer (polyfun)#
Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun’s ldsc.py. From each polyfun run we obtain, per annotation:
\(\tau_C\) and its standard error — (polyfun).
\(\pi^{h^2}_C\) and \(\pi^{M}_C\) — (polyfun).
\(E_C = \pi^{h^2}_C / \pi^{M}_C\) and its standard error — (polyfun).
The p-value of the differential per-SNP heritability test (defined below) — (polyfun), computed internally with the full coefficient covariance matrix.
We also obtain, per run:
The total trait heritability \(h^2_g\) — (polyfun).
The 200-block jackknife delete-values of \(\tau_C\) — (polyfun).
Quantities from the post-processing layer (pecotmr)#
From the polyfun outputs above plus our reference panel, the post-processing layer computes:
\(sd_C\) — per-annotation standard deviation over MAF \(>\) cutoff SNPs — (pecotmr:
compute_sldsc_annot_sd).\(M_{\mathrm{ref}}\) — reference SNP count at the MAF cutoff — (pecotmr:
compute_sldsc_M_ref).Whether each annotation is binary or continuous — (pecotmr:
is_binary_sldsc_annot).\(\tau^*_C\) point estimate and per-block \(\tau^*_C\) — (pecotmr:
standardize_sldsc_trait).EnrichStat point estimate and its standard error (formula below) — (pecotmr:
standardize_sldsc_trait).DerSimonian-Laird random-effects meta-analysis of \(\tau^*_C\), \(E_C\), or EnrichStat across traits — (pecotmr:
meta_sldsc_random).
The top-level entry point pecotmr::sldsc_postprocessing_pipeline orchestrates all of the above.
Standardized tau (\(\tau^*\)) — (pecotmr)#
\(\tau_C\) has units that depend on the scale of the annotation and on the total heritability of the trait, so raw \(\tau\) is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)
interpreted as the additive change in per-SNP heritability associated with a 1 standard deviation increase in annotation \(C\), divided by the average per-SNP heritability across all SNPs. \(\tau^*_C\) is dimensionless and comparable across annotations and across traits. In a joint multi-annotation regression it is the independent contribution of annotation \(C\) after controlling for overlapping effects of the others.
Here \(sd_C\) is the standard deviation of annotation \(C\) across reference SNPs (MAF \(>\) cutoff), \(M_{\mathrm{ref}}\) is the count of those SNPs, and \(h^2_g\) is the trait heritability. Applying the same scaling to each of the 200 jackknife blocks yields per-block \(\tau^*_C\) values; their sample variance gives the jackknife standard error $\(SE^{\text{jackknife}}(\tau^*_C) \;=\; \sqrt{\,\tfrac{(B-1)^2}{B}\, \mathrm{Var}_b(\tau^*_{C,(b)})\,}\)\( with \)B = 200$, used as the per-trait input to cross-trait meta-analysis.
Differential per-SNP heritability (“EnrichStat”) — (polyfun + pecotmr)#
To test whether the per-SNP heritability inside annotation \(C\) differs from outside it (Finucane et al. 2015):
The point-estimate p-value of this test is computed by polyfun internally using the full coefficient covariance and reported as Enrichment_p. Its standard error is recovered from the reported p-value:
This per-trait point + SE is the input to cross-trait meta-analysis.
Reporting: binary vs. continuous annotations — (pecotmr)#
The estimation machinery applies to both annotation types, but the headline quantity to report within each type differs.
For a binary annotation (e.g. enhancer indicator, exon, in/out of a cell-type peak), \(\pi^{M}_C = M_C / M_{\mathrm{ref}}\) has a direct interpretation and \(E_C\) reads as “the category explains \(E_C\)-fold more heritability than its share of SNPs.” The within-type headline quantities are therefore \(E_C\) and the EnrichStat p-value; \(\tau^*_C\) is reported alongside.
For a continuous annotation (e.g. gene-specificity score, conservation score, continuous epigenomic signal), \(E_C\) depends on the scale of the annotation: rescaling the annotation by a constant changes \(E_C\) even though the underlying biology is unchanged. The within-type headline quantities are therefore \(\tau^*_C\) and its p-value; \(E_C\) is reported alongside but should not be interpreted for continuous annotations.
The pipeline determines whether an annotation is binary by inspecting whether its values lie in \(\{0, 1\}\) and selects the appropriate within-type headline statistic automatically (pecotmr).
From the official LDSC tutorial (Partitioned Heritability from Continuous Annotations):
“Enrichment is (Prop. heritability) / (Prop. SNPs). These outputs make sense only for binary annotations. Do not try to interpret them for continuous annotations. Using
--print-coefficientsoutputs the regression coefficients and corresponding standard errors and Z score for each annotation. These coefficients measure the additional contribution of one annotation to the model and are interpretable for both binary and continuous annotations.”The pipeline always passes
--print-coefficientsto polyfun for this reason.
Cross-type comparison: always use \(\tau^*_C\) — (pecotmr)#
For an apple-to-apple comparison across binary and continuous annotations — ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types — use \(\tau^*_C\). The standardization in Gazal et al. (2017) was designed for exactly this purpose: \(sd_C = \sqrt{p(1-p)}\) for a binary annotation (where \(p\) is the proportion in the category) and \(sd_C = \) empirical standard deviation for a continuous annotation, so the resulting \(\tau^*_C\) is dimensionless and has the same interpretation in both cases — additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. \(E_C\) does not have this property and must not be compared across types.
The pipeline emits both \(E_C\) and \(\tau^*_C\) for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.
Joint analysis — (polyfun runs the regression; pecotmr standardizes both modes)#
For joint analysis (multiple annotations fit together), both \(\tau\) and \(E\) are conditional on the other annotations in the model. We report joint \(\tau^*_C\) as the independent contribution of annotation \(C\) after controlling for the others. The annotation-prep step exposes two independent toggles, compute_single and compute_joint (both default True), so the user can produce the \(N\) single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all \(N+1\) regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory \(\rightarrow\) conditional workflow), the user runs the annotation-prep step a second time with compute_single=False on the curated subset.
Meta-Analysis across Traits (Random Effects) — (pecotmr)#
DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in pecotmr::meta_sldsc_random (which delegates the numerics to rmeta::meta.summaries(..., method = "random")):
where \(\hat\theta_i\) is the per-trait estimate and \(SE_i\) its standard error:
For \(\tau^*_C\) meta: \(SE_i\) is the jackknife SE from the per-block \(\tau^*_C\) values.
For \(E_C\) meta: \(SE_i\) is the polyfun-reported
Enrichment_std_error.For EnrichStat meta: \(SE_i\) is the back-solved SE from polyfun’s
Enrichment_p.
For binary-annotation enrichment reporting we use a two-channel meta: the effect size and SE come from the meta on \(E_C\) (interpretable on the original enrichment-fold scale), while the p-value comes from the meta on EnrichStat (the appropriate hypothesis test). The pipeline produces a default meta over all supplied traits; users can re-run meta on any subset of traits without re-running the regression layer.
Minimal Working Example (MWE)#
The steps below run the four pipeline workflows end to end on the example data. Each step lists what it does, then the sos run command to execute it.
Step 1. make_annotation_files_ldscore#
Annotation preparation and S-LDSC regression (polyfun). This step accepts a single annotation file for a single-tau analysis (one annotation as input) or several annotation files for a joint-tau analysis (multiple annotations as input).
Inputs#
1. Target Annotation File#
Purpose: Specifies the user-provided (“target”) genome annotation files. The pipeline supports both binary and continuous annotations; the type is auto-detected per annotation column.
Formats:
Text file (
.txt) listing per-chromosome paths to annotation files. Annotation files can be.rds/.tsv/.txt.Alternatively, files for specific chromosomes can be provided directly.
Multiple target annotations are supported in one input file (one column per annotation, prefixed
path,path1,path2, …). Single-target and joint-target analyses are produced automatically in one pipeline pass.Format (the score column is optional; if absent, score is set to 1):
is_range = False:
chr pos score 1 10001 1 1 10002 1
is_range = True:
chr start end score 1 10001 20001 1 1 30001 40001 1
2. Reference Annotation File (baseline-LD)#
Purpose: Provides the baseline annotations (typically the 97-annotation baseline-LD model from Gazal et al. 2017) in
.annot.gzformat for each chromosome. The baseline conditions every regression.Formats:
Text file listing baseline annotation files for all chromosomes.
Alternatively, files for specific chromosomes can be provided directly.
3. Genome Reference File#
Purpose: PLINK-format
.bed/.bim/.famfiles for our LD reference panel, per chromosome. This is the panel against which all LD-derived quantities (target LD scores, baseline LD scores, regression weights, allele frequencies) must be computed. Do not mix files derived from different panels (e.g. 1000G vs ADSP).Formats:
Text file listing per-chromosome reference files, or files for specific chromosomes.
4. SNP List#
Purpose: Specifies the SNPs to include in LDSC analysis (typically a HapMap3-style list).
Format: A list of
rsids, one per line.
sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \
--annotation_file input/enrichment/sldsc/colocboost_test_annotation_path.txt \
--reference_anno_file input/enrichment/sldsc/reference_annotation0.txt \
--genome_ref_file input/enrichment/sldsc/genome_reference_bfile.txt \
--annotation_name protocol_example \
--plink_name reference. --baseline_name annotations. --weight_name weights. \
--python_exec python \
--polyfun_path polyfun \
--cwd output/sldsc_ldscore -j 4
/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
INFO: Running make_annotation_files_ldscore:
INFO: make_annotation_files_ldscore (index=1) is completed.
INFO: make_annotation_files_ldscore (index=3) is completed.
INFO: make_annotation_files_ldscore (index=2) is completed.
INFO: make_annotation_files_ldscore (index=0) is completed.
INFO: make_annotation_files_ldscore (index=5) is completed.
INFO: make_annotation_files_ldscore (index=6) is completed.
INFO: make_annotation_files_ldscore (index=4) is completed.
INFO: make_annotation_files_ldscore (index=7) is completed.
INFO: make_annotation_files_ldscore (index=9) is completed.
INFO: make_annotation_files_ldscore (index=10) is completed.
INFO: make_annotation_files_ldscore (index=8) is completed.
INFO: make_annotation_files_ldscore (index=11) is completed.
INFO: make_annotation_files_ldscore (index=14) is completed.
INFO: make_annotation_files_ldscore (index=13) is completed.
INFO: make_annotation_files_ldscore (index=12) is completed.
INFO: make_annotation_files_ldscore (index=15) is completed.
INFO: make_annotation_files_ldscore (index=18) is completed.
INFO: make_annotation_files_ldscore (index=16) is completed.
INFO: make_annotation_files_ldscore (index=17) is completed.
INFO: make_annotation_files_ldscore (index=19) is completed.
INFO: make_annotation_files_ldscore (index=21) is completed.
INFO: make_annotation_files_ldscore (index=20) is completed.
INFO: make_annotation_files_ldscore output: /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.annot.gz /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_ldscore/protocol_example_single_1/protocol_example_single_1.1.l2.ldscore.parquet... (66 items in 22 groups)
INFO: Workflow make_annotation_files_ldscore (ID=weae0ca3fdf468fd8) is executed successfully with 1 completed step and 22 completed substeps.
Munge summary statistics (preprocessing, run before Step 2)#
Before estimating heritability, each raw GWAS summary-statistics file must be converted into the LDSC-compatible format consumed by get_heritability. Run munge_sumstats_polyfun once per trait; the munged files are then collected in the directory passed to get_heritability via --sumstat_dir.
# sos run pipeline/sldsc_enrichment.ipynb munge_sumstats_polyfun \
# --sumstats data/polyfun_new/example_data/trait_raw_sumstats.tsv \
# --n 0 \
# --min-info 0.6 \
# --min-maf 0.001 \
# --chi2-cutoff 30 \
# --polyfun_path data/github/polyfun \
# --cwd data/polyfun_new/example_data
Step 2. get_heritability#
Inputs
1. Allele Frequency Files (.frq, our panel)#
Purpose: PLINK
.frqfiles for the reference panel, used to enforce the MAF cutoff. Required whenmaf_cutoff > 0(default0.05); the pipeline fails if missing unlessmaf_cutoff = 0is explicitly set.
2. GWAS Summary Statistics#
Purpose: One munged sumstats file per trait, listed in a text file (
all_traits_file). The pipeline runs the regression once per trait per single/joint mode.Format:
CAD_META.filtered.sumstats.gz UKB.Lym.BOLT.sumstats.gz
sos run pipeline/sldsc_enrichment.ipynb get_heritability \
--target_anno_dirs output/sldsc_ldscore/protocol_example_single_1 \
--all_traits_file input/enrichment/sldsc/sumstats_test_all.txt \
--sumstat_dir input/enrichment/sldsc \
--baseline_ld_dir input/enrichment/sldsc \
--weights_dir input/enrichment/sldsc \
--plink_name reference. --baseline_name annotations. --weight_name weights. \
--annotation_name protocol_example --python_exec python \
--polyfun_path ../polyfun \
--maf_cutoff 0 --cwd output/sldsc_heritability -j 4
/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
INFO: Running get_heritability:
maf_cutoff = 0: skipping MAF filter (--not-M-5-50)
maf_cutoff = 0: skipping MAF filter (--not-M-5-50)
maf_cutoff = 0: skipping MAF filter (--not-M-5-50)
python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory
python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory
python: can't open file '/restricted/projectnb/xqtl/jaempawi/xqtl-protocol/../polyfun/ldsc.py': [Errno 2] No such file or directory
INFO: get_heritability (index=1) is completed.
INFO: get_heritability (index=0) is completed.
INFO: get_heritability (index=2) is completed.
INFO: get_heritability output: /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.log /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_heritability/protocol_example_single_1/sumstats.parquet.results... (6 items in 3 groups)
INFO: Workflow get_heritability (ID=wa79eac1662f5dd2d) is executed successfully with 1 completed step and 3 completed substeps.
Step 3. Post-processing (pecotmr) and meta-analysis#
Post-Processing (pecotmr::sldsc_postprocessing_pipeline)
A single R function call consumes all polyfun outputs for the run and produces the final tables:
Reads each polyfun output and extracts \(\tau\), \(E\), \(h^2_g\), EnrichStat p-value, and per-block jackknife \(\tau\) values.
Computes annotation \(sd_C\) and \(M_{\mathrm{ref}}\) over the same MAF \(>\) cutoff SNP set as the regression.
Standardizes \(\tau \to \tau^*\) for both single-tau and joint-tau modes, including the per-block versions for jackknife SE.
Auto-detects whether each annotation is binary or continuous and tags every output row accordingly.
Reports the number and names of baseline annotations encountered (via
message()) for transparency.Runs the default DerSimonian-Laird random-effects meta-analysis across all supplied traits, producing three meta tables: \(\tau^*\) (cross-type comparable), \(E\) (within-binary), and EnrichStat (within-type).
Outputs are returned as an R list with two top-level entries: per_trait (one tidy data frame per trait, single + joint estimates side-by-side per target) and meta (three tables, one per quantity, with rows = target annotations and columns = single/joint mean/SE/p plus an is_binary flag).
The [postprocess] step reads all polyfun outputs under heritability_cwd
(which contains the \(N\) single-target subdirectories and optionally the
joint subdirectory) and calls pecotmr::sldsc_postprocessing_pipeline()
to produce per-trait standardized tables and the default random-effects
meta across all traits.
Use --target-categories-label (same order as --target-categories) to give the target annotations friendly names in the output — e.g. --target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL makes the target column read quantile_eQTL / eQTL instead of ANNOT_1_0 / ANNOT_2_0 (the original names are kept in params$target_categories_orig). Omit it to keep the polyfun .results names.
sos run pipeline/sldsc_enrichment.ipynb postprocess \
--traits_file input/enrichment/sldsc/sumstats_test_all.txt \
--heritability_cwd output/sldsc_heritability \
--target_categories ANNOT_0 --target_categories_label protocol_example_annotation \
--target_anno_dir output/sldsc_ldscore/protocol_example_single_1 \
--annotation_name protocol_example --python_exec python \
--polyfun_path ../polyfun \
--maf_cutoff 0 --cwd output/sldsc_postprocess -j 4
/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
INFO: Running postprocess:
INFO: postprocess is completed.
INFO: postprocess output: /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds
INFO: Workflow postprocess (ID=wb64dc2b84958960c) is executed successfully with 1 completed step.
Step 4. Subset Meta-Analysis (pecotmr::meta_sldsc_random) (optional)#
The default meta in Step 2 pools all traits the user supplied. To re-run the meta on a subset (e.g., neurodegenerative traits only, or autoimmune traits only) without re-running the regression layer:
res <- readRDS("sldsc_results.rds")
neuro <- c("AD_GWAX", "PD_meta", "ALS_meta")
meta_neuro_taustar <- pecotmr::meta_sldsc_random(
res$per_trait[neuro], category = "my_target_anno", quantity = "tau_star"
)
This step is light-weight and can be run interactively.
The default meta in step 3 pools all traits supplied to [postprocess]. Use [meta_subset] to re-run the meta on a user-defined trait subset (e.g., neurodegenerative traits only, autoimmune traits only) without re-running the regression or the per-trait standardization. The subset operates on the cached .sldsc_postprocess.rds output; it is light-weight and can be run interactively or in batch.
sos run pipeline/sldsc_enrichment.ipynb meta_subset \
--postprocess_rds output/sldsc_postprocess/protocol_example.sldsc_postprocess.rds \
--subset_traits_file input/enrichment/sldsc/sumstats_test_category1.txt \
--subset_name category1 --target_categories ANNOT_0 \
--annotation_name protocol_example --python_exec python \
--polyfun_path ../polyfun \
--maf_cutoff 0 --cwd output/sldsc_postprocess -j 4
/restricted/projectnb/xqtl/jaempawi/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
INFO: Running meta_subset:
INFO: meta_subset is completed.
INFO: meta_subset output: /restricted/projectnb/xqtl/jaempawi/xqtl-protocol/output/sldsc_postprocess/protocol_example.category1.meta.rds
INFO: Workflow meta_subset (ID=w09a2a0530119f1d2) is executed successfully with 1 completed step.
Output#
Output summary#
Stage |
Cached on disk |
Recomputable from |
Purpose |
|---|---|---|---|
Target LD scores |
per-annotation, once |
annotation + reference panel |
input to every regression |
polyfun |
yes |
regression run |
\(\tau\), \(E\), EnrichStat |
Per-trait standardized table |
yes (RDS) |
polyfun outputs + \(sd_C\) + \(M_{\mathrm{ref}}\) |
reporting + meta |
Default meta tables |
yes (RDS) |
per-trait standardized |
headline figures |
Subset meta |
re-run on demand |
per-trait standardized |
custom analyses |
Per-stage outputs#
Each workflow writes into its --cwd:
make_annotation_files_ldscore — polyfun
.annot.gzfiles plus per-annotation LD-score directories (.l2.ldscore.{gz,parquet},.l2.M,.l2.M_5_50). One single-target directory per annotation, plus (when more than one annotation) a joint directory.get_heritability — per trait and per target directory, the S-LDSC regression outputs
<trait>.{results,log,part_delete}. The.resultsCategorycolumn carries the annotation name with a_<ref-ld-index>suffix.postprocess — a single
<annotation_name>.sldsc_postprocess.rdscontaining per-trait tables (Gazal-style tau*, EnrichStat with back-solved jackknife SE) and three DerSimonian–Laird random-effects meta tables (tau*, E, EnrichStat).meta_subset — a re-meta of the cached
.sldsc_postprocess.rdsover a user-defined trait subset (lightweight; no regression re-run).
Anticipated Results#
Produces per-annotation enrichment statistics (tau, enrichment, p-value) from stratified LD score regression.
Command interface#
List all workflows and their options:
sos run pipeline/sldsc_enrichment.ipynb -h
Workflow implementation#
The cells below are the pipeline definition (preserved from the original notebook): the [global] parameter block and the workflow step bodies.
[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Prefix for the analysis output
parameter: annotation_name = str
parameter: python_exec = "python" # e.g. "/home/you/.conda/envs/polyfun/bin/python"
parameter: polyfun_path = path # e.g. "/home/you/tools/polyfun"
# MAF cutoff for sLDSC. Default 0.05 per sLDSC recommendation (rare-variant LD is unstable
# and HapMap3-style regression weights are common-variant by construction).
# Set to 0 to opt out of MAF filtering (NOT recommended; only use if you understand the implications).
# Other values would require recomputing LD scores at that cutoff.
parameter: maf_cutoff = 0.05
# for make_annotation_files_ldscore workflow:
parameter: annotation_file = path()
parameter: reference_anno_file = path()
parameter: genome_ref_file = path() # with .bed
parameter: chromosome = []
parameter: snp_list = path()
parameter: ld_wind_kb = 0 # use kb if the value is provided
parameter: ld_wind_cm = 1.0 # default using ld_wind_cm
# for get_heritability workflow.
# Note: all LD-derived inputs (baseline LD scores, target LD scores, regression weights,
# allele frequencies) must be computed against the same reference panel as `genome_ref_file`.
# Do not mix files derived from different reference panels (e.g., 1000G vs ADSP).
parameter: all_traits_file = path() # txt file, each row contains all GWAS summary statistics name: e.g. CAD_META.filtered.sumstats.gz
parameter: sumstat_dir = path() # Directory containing GWAS summary statistics
parameter: target_anno_dir = path() # Directory containing target annotation files: output of ldscore
parameter: baseline_ld_dir = path() # Directory containing baseline LD score files (computed against our panel)
parameter: frqfile_dir = path() # Directory containing allele frequency files (.frq, our panel)
parameter: plink_name = "ADSP_chr"
parameter: weights_dir = path() # Directory containing LD weights (computed against our panel)
parameter: baseline_name = "baseline_chr" # Prefix of baseline annotation files
parameter: weight_name = "weights_chr" # Prefix of LD weights files
parameter: n_blocks = 200
# Number of threads
parameter: numThreads = 16
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
Make Annotation File#
[make_annotation_files_ldscore]
# Annotation preparation. Takes one annotation_file with N target annotations
# and produces, in one invocation, any combination of:
# - N single-target LD-score directories (when compute_single = TRUE, default)
# - 1 joint LD-score directory containing all N (when compute_joint = TRUE
# and N >= 2, default)
#
# Outputs per chromosome <chr>:
# <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.annot.gz (i in 1..N, when compute_single)
# <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.ldscore.{parquet|gz}
# <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M
# <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M_5_50 (when .frq present)
#
# <cwd>/<annotation_name>_joint/<annotation_name>_joint.<chr>.{...} (when compute_joint and N>=2)
#
# Workflows:
# - Workflow A ("all at once"): compute_single=TRUE, compute_joint=TRUE (defaults).
# Produces both, fits the case where you have already chosen the joint set.
# - Workflow B ("exploratory then conditional"):
# Step 1: compute_single=TRUE, compute_joint=FALSE.
# Run on N candidate annotations -> N single-target dirs.
# Inspect single-target results, identify K significant ones.
# Step 2: compute_single=FALSE, compute_joint=TRUE.
# Run on a NEW annotation_file with the K selected annotations
# -> 1 joint dir with the conditional model.
#
# --- snplist (--snp_list) vs no-snplist: which polyfun script, output format,
# column name, and the CM requirement ---
# --snp_list given -> ldsc.py --l2 --print-snps -> output .l2.ldscore.gz
# --snp_list absent -> compute_ldscores.py -> output .l2.ldscore.parquet
#
# LD-score column name (this is what becomes the .results "Category" in
# [get_heritability], with a "_<ref-ld-index>" suffix appended there):
# * compute_ldscores.py ALWAYS keeps the annot column name(s):
# single annot column "ANNOT" -> ldscore column "ANNOT"
# joint annot columns "ANNOT_1","ANNOT_2",... -> "ANNOT_1","ANNOT_2",...
# * ldsc.py --l2 has a quirk: with EXACTLY ONE annotation (n_annot == 1) it
# HARD-CODES the ldscore column name to "L2" and DROPS the annot's original
# column name. With >=2 annotations it uses "<annot_name>L2"
# ("ANNOT_1L2","ANNOT_2L2",...).
# => a single-target snplist run reports "L2_0" in .results, while a
# single-target no-snplist run reports "ANNOT_0". [postprocess] auto-
# detects either; only matters if you pass --target-categories explicitly.
#
# CM column requirement for snplist: ldsc.py --l2 --print-snps requires the
# target annot to (a) carry a "CM" (centimorgan) column and (b) line up with
# the plink .bim (same SNP set, same row order). This step handles both
# internally (normalize_for_ldsc: takes CM from the .bim 4th column, re-expands
# the annot onto the .bim rows, filling 0). Therefore the plink .bim files MUST
# carry genetic-map (cM) positions when using --ld-wind-cm (the default);
# if your .bim has 0 in the cM column, switch to --ld-wind-kb instead.
#
parameter: compute_single = True
parameter: compute_joint = True
parameter: score_column = 3
parameter: is_range = False
import pandas as pd
import os
if not (compute_single or compute_joint):
raise ValueError("[make_annotation_files_ldscore] at least one of compute_single or compute_joint must be TRUE")
def adapt_file_path(file_path, reference_file):
reference_path = os.path.dirname(reference_file)
if os.path.isfile(file_path):
return file_path
file_name = os.path.basename(file_path)
if os.path.isfile(file_name):
return file_name
file_in_ref_dir = os.path.join(reference_path, file_name)
if os.path.isfile(file_in_ref_dir):
return file_in_ref_dir
file_prefixed = os.path.join(reference_path, file_path)
if os.path.isfile(file_prefixed):
return file_prefixed
raise FileNotFoundError(f"No valid path found for file: {file_path}")
# ---- Parse inputs and determine N ----
if (str(annotation_file).endswith(('rds', 'tsv', 'txt', 'tsv.gz', 'txt.gz')) and
str(reference_anno_file).endswith('annot.gz')):
# Case 1: direct file paths (single-chromosome run). Multiple target files separated by ','.
target_files_direct = str(annotation_file).split(',')
N_targets = len(target_files_direct)
target_names = [f"target_{i+1}" for i in range(N_targets)]
input_files = [[*target_files_direct, str(reference_anno_file), str(genome_ref_file)]]
if len(chromosome) > 0:
input_chroms = [int(x) for x in chromosome]
else:
input_chroms = [0]
else:
# Case 2: txt list with #id and one or more 'path' columns
target_files_df = pd.read_csv(annotation_file, sep="\t")
reference_files = pd.read_csv(reference_anno_file, sep="\t")
genome_ref_files = pd.read_csv(genome_ref_file, sep="\t")
target_files_df["#id"] = [x.replace("chr", "") for x in target_files_df["#id"].astype(str)]
reference_files["#id"] = [x.replace("chr", "") for x in reference_files["#id"].astype(str)]
genome_ref_files["#id"] = [x.replace("chr", "") for x in genome_ref_files["#id"].astype(str)]
path_columns = [c for c in target_files_df.columns if c.startswith('path')]
N_targets = len(path_columns)
target_names = path_columns[:] # 'path', 'path1', 'path2', ...
for col in path_columns:
target_files_df[col] = target_files_df[col].apply(lambda x: adapt_file_path(x, str(annotation_file)))
reference_files["path"] = reference_files["path"].apply(lambda x: adapt_file_path(x, str(reference_anno_file)))
genome_ref_files["path"] = genome_ref_files["path"].apply(lambda x: adapt_file_path(x, str(genome_ref_file)))
merged = target_files_df.merge(reference_files, on="#id").merge(genome_ref_files, on="#id")
if len(chromosome) > 0:
merged = merged[merged["#id"].isin([str(c) for c in chromosome])]
rows = merged.values.tolist()
input_chroms = [r[0] for r in rows]
input_files = [[*r[1:N_targets+1], r[-2], r[-1]] for r in rows]
# ---- Determine output format ----
use_print_snps = snp_list.is_file()
ldscore_ext = "l2.ldscore.gz" if use_print_snps else "l2.ldscore.parquet"
if ld_wind_kb > 0:
use_kb_window = True
ld_window_param = ld_wind_kb
ld_window_flag = "--ld-wind-kb"
else:
use_kb_window = False
ld_window_param = ld_wind_cm
ld_window_flag = "--ld-wind-cm"
emit_single = compute_single
emit_joint = compute_joint and N_targets >= 2
# ---- Build per-chromosome output list ----
def chrom_outputs(chrom):
outs = []
if emit_single:
for i in range(N_targets):
name = f"{annotation_name}_single_{i+1}"
prefix = f"{cwd:a}/{name}/{name}.{chrom}"
outs += [f"{prefix}.annot.gz", f"{prefix}.{ldscore_ext}", f"{prefix}.l2.M"]
if emit_joint:
name = f"{annotation_name}_joint"
prefix = f"{cwd:a}/{name}/{name}.{chrom}"
outs += [f"{prefix}.annot.gz", f"{prefix}.{ldscore_ext}", f"{prefix}.l2.M"]
return outs
input: input_files, group_by = N_targets + 2, group_with = "input_chroms"
output: chrom_outputs(input_chroms[_index])
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
# ----------------------------------------------------------------------------
# Step A: write the requested .annot files for this chromosome.
# ----------------------------------------------------------------------------
R: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
library(data.table)
clean_chr <- function(x) as.numeric(gsub("^chr", "", x))
process_range_data <- function(data, chr_value) {
data$chr <- clean_chr(data$chr)
data <- data[data$chr == chr_value,]
if (nrow(data) == 0) return(NULL)
expanded <- lapply(seq_len(nrow(data)), function(j) {
row <- data[j,]
pos_seq <- seq(row$start, row$end - 1)
result <- data.frame(chr = rep(row$chr, length(pos_seq)), pos = pos_seq)
if (ncol(data) > 3) {
for (col in 4:ncol(data))
result[[names(data)[col]]] <- rep(row[[col]], length(pos_seq))
}
result
})
unique(rbindlist(expanded))
}
process_annotation <- function(target_anno, ref_anno, score_column_value) {
target_anno <- as.data.frame(target_anno)
ref_anno <- as.data.frame(ref_anno)
target_anno$chr <- clean_chr(target_anno$chr)
ref_anno$CHR <- clean_chr(ref_anno$CHR)
chr_value <- unique(ref_anno$CHR)
anno_scores <- rep(0, nrow(ref_anno))
match_pos <- match(target_anno$pos, ref_anno$BP)
valid_pos <- as.numeric(na.omit(match_pos))
if (score_column_value <= ncol(target_anno)) {
anno_scores[valid_pos] <- target_anno[[score_column_value]][!is.na(match_pos)]
} else {
anno_scores[valid_pos] <- 1
print("Warning: score column does not exist; setting scores to 1")
}
anno_scores
}
read_target_anno <- function(file_path, ref_anno) {
if (endsWith(file_path, "rds")) {
target_anno <- readRDS(file_path)
return(process_annotation(target_anno, ref_anno, ${score_column}))
}
target_anno <- fread(file_path)
if (${"TRUE" if is_range else "FALSE"}) {
names(target_anno)[1:3] <- c("chr", "start", "end")
target_anno <- process_range_data(target_anno, unique(ref_anno$CHR))
if (is.null(target_anno)) return(rep(0, nrow(ref_anno)))
} else {
names(target_anno)[1:2] <- c("chr", "pos")
}
process_annotation(target_anno, ref_anno, ${score_column})
}
# ---- Read reference annotation ----
ref_anno <- as.data.frame(fread(${_input[-2]:ar}))
if ("ANNOT" %in% colnames(ref_anno)) ref_anno <- ref_anno[, -which(colnames(ref_anno) == "ANNOT")]
# ---- Compute per-target annotation scores ----
target_files <- c(${",".join('"%s"' % str(p.absolute()) for p in _input[:-2])})
N_local <- length(target_files)
score_list <- lapply(target_files, read_target_anno, ref_anno = ref_anno)
emit_single_local <- ${"TRUE" if emit_single else "FALSE"}
emit_joint_local <- ${"TRUE" if emit_joint else "FALSE"}
use_print_snps_local <- ${"TRUE" if use_print_snps else "FALSE"}
bfile_prefix <- "${_input[-1]:na}"
# Reshape annot to match .bim panel for ldsc.py --l2 --print-snps
# (drop A1/A2/MAF, expand to .bim rows filling 0, take CM from .bim).
normalize_for_ldsc <- function(df) {
if (!use_print_snps_local) return(df)
df <- df[, !names(df) %in% c("A1", "A2", "MAF", "CM"), drop = FALSE]
annot_cols <- setdiff(names(df), c("CHR", "BP", "SNP"))
bim <- as.data.frame(fread(paste0(bfile_prefix, ".bim"), header = FALSE,
col.names = c("CHR", "SNP", "CM", "BP", "A1", "A2")))
bim$CHR <- as.character(bim$CHR); df$CHR <- as.character(df$CHR)
idx <- match(bim$SNP, df$SNP)
out <- data.frame(CHR = bim$CHR, BP = bim$BP, SNP = bim$SNP, CM = bim$CM,
stringsAsFactors = FALSE)
for (col in annot_cols) {
v <- rep(0, nrow(bim))
non_na <- !is.na(idx)
v[non_na] <- df[[col]][idx[non_na]]
out[[col]] <- v
}
out
}
# ---- Write N single-target .annot files (when requested) ----
if (emit_single_local) {
for (i in seq_len(N_local)) {
out_anno <- ref_anno
out_anno$ANNOT <- score_list[[i]]
out_anno <- normalize_for_ldsc(out_anno)
name <- paste0("${annotation_name}", "_single_", i)
out_path_gz <- file.path("${cwd:a}", name, paste0(name, ".${input_chroms[_index]}.annot.gz"))
out_path_tsv <- sub("\\.gz$", "", out_path_gz)
dir.create(dirname(out_path_gz), showWarnings = FALSE, recursive = TRUE)
fwrite(out_anno, out_path_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
}
}
# ---- Optionally write joint .annot ----
if (emit_joint_local) {
joint_anno <- ref_anno
for (i in seq_len(N_local)) {
joint_anno[[paste0("ANNOT_", i)]] <- score_list[[i]]
}
joint_anno <- normalize_for_ldsc(joint_anno)
joint_name <- paste0("${annotation_name}", "_joint")
joint_out_gz <- file.path("${cwd:a}", joint_name, paste0(joint_name, ".${input_chroms[_index]}.annot.gz"))
joint_out_tsv <- sub("\\.gz$", "", joint_out_gz)
dir.create(dirname(joint_out_gz), showWarnings = FALSE, recursive = TRUE)
fwrite(joint_anno, joint_out_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
}
# ----------------------------------------------------------------------------
# Step B: gzip all annot files. Uses expand="$[ ]" so bash ${var} survives.
# ----------------------------------------------------------------------------
bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
set -e
annots=()
if [ "$[str(emit_single)]" = "True" ]; then
for i in $(seq 1 $[N_targets]); do
annots+=("$[cwd:a]/$[annotation_name]_single_$i/$[annotation_name]_single_$i.$[input_chroms[_index]].annot")
done
fi
if [ "$[str(emit_joint)]" = "True" ]; then
annots+=("$[cwd:a]/$[annotation_name]_joint/$[annotation_name]_joint.$[input_chroms[_index]].annot")
fi
for a in "${annots[@]}"; do
gzip -f "$a"
done
# ----------------------------------------------------------------------------
# Step C: run polyfun's LD-score computation for each emitted annotation file.
# ----------------------------------------------------------------------------
bash: expand = "$[ ]", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'
set -e
chrom="$[input_chroms[_index]]"
run_polyfun() {
local annot="$1"
local out_prefix="$2"
if [ "$[str(use_print_snps)]" = "True" ]; then
$[python_exec] $[polyfun_path]/ldsc.py \
--print-snps $[snp_list] \
$[ld_window_flag] $[ld_window_param] \
--out "$out_prefix" \
--bfile $[_input[-1]:nar] \
--yes-really \
--annot "$annot" \
--l2
else
$[python_exec] $[polyfun_path]/compute_ldscores.py \
--annot "$annot" \
--bfile $[_input[-1]:nar] \
$[ld_window_flag] $[ld_window_param] \
--out "${out_prefix}.$[ldscore_ext]" \
--allow-missing
fi
}
if [ "$[str(emit_single)]" = "True" ]; then
for i in $(seq 1 $[N_targets]); do
name="$[annotation_name]_single_$i"
annot="$[cwd:a]/$name/$name.$chrom.annot.gz"
prefix="$[cwd:a]/$name/$name.$chrom"
run_polyfun "$annot" "$prefix"
done
fi
if [ "$[str(emit_joint)]" = "True" ]; then
name="$[annotation_name]_joint"
annot="$[cwd:a]/$name/$name.$chrom.annot.gz"
prefix="$[cwd:a]/$name/$name.$chrom"
run_polyfun "$annot" "$prefix"
fi
# ----------------------------------------------------------------------------
# Step D: write .l2.M and .l2.M_5_50 files for each emitted annotation directory.
# ----------------------------------------------------------------------------
R: expand = "${ }", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout'
suppressPackageStartupMessages({ library(data.table); library(dplyr) })
use_print_snps <- ${str(use_print_snps).upper()}
chrom <- "${input_chroms[_index]}"
# Look up .frq file under frqfile_dir, using plink_name + chrom (matches cell 25).
frq_file <- file.path("${frqfile_dir}", paste0("${plink_name}", chrom, ".frq"))
has_frq <- file.exists(frq_file)
frq_dt <- if (has_frq) fread(frq_file)[, .(SNP, MAF)] else NULL
write_M_files <- function(annot_path, ldscore_path, m_path) {
if (use_print_snps && file.exists(m_path) && file.exists(paste0(m_path, "_5_50"))) {
cat("M files already exist for", m_path, "\n"); return(invisible())
}
ldscore_dt <- if (endsWith(ldscore_path, ".parquet")) {
suppressPackageStartupMessages(library(arrow)); arrow::read_parquet(ldscore_path)
} else fread(ldscore_path)
annot_dt <- fread(annot_path)
annot_filtered <- annot_dt[annot_dt$SNP %in% ldscore_dt$SNP, ]
merged <- if (has_frq) merge(annot_filtered, frq_dt, by = "SNP", all.x = TRUE) else annot_filtered
std_cols <- c("CHR", "SNP", "BP", "CM", "A1", "A2", if (has_frq) "MAF")
annot_cols <- setdiff(names(merged), std_cols)
if (length(annot_cols) == 0L) { merged[, ANNOT := 1L]; annot_cols <- "ANNOT" }
M <- merged[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]
writeLines(paste(as.numeric(M), collapse = " "), m_path)
if (has_frq) {
common <- merged[!is.na(MAF) & MAF > 0.05, ]
M5 <- common[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]
writeLines(paste(as.numeric(M5), collapse = " "), paste0(m_path, "_5_50"))
}
}
targets <- c()
if (${"TRUE" if emit_single else "FALSE"}) {
for (i in seq_len(${N_targets})) {
targets <- c(targets, paste0("${annotation_name}", "_single_", i))
}
}
if (${"TRUE" if emit_joint else "FALSE"}) {
targets <- c(targets, paste0("${annotation_name}", "_joint"))
}
for (name in targets) {
annot_path <- file.path("${cwd:a}", name, paste0(name, ".", chrom, ".annot.gz"))
ldscore_path <- file.path("${cwd:a}", name, paste0(name, ".", chrom, ".${ldscore_ext}"))
m_path <- file.path("${cwd:a}", name, paste0(name, ".", chrom, ".l2.M"))
write_M_files(annot_path, ldscore_path, m_path)
}
Calculate Functional Enrichment using Annotations#
[get_heritability]
# Per-trait sLDSC regression via polyfun. Fans out across target_anno_dirs:
# each (trait, target_dir) pair becomes one polyfun invocation. Outputs go to
# <cwd>/<basename(target_dir)>/<trait>.{results,log,part_delete}.
#
# `target_anno_dirs` is the list produced by [make_annotation_files_ldscore]:
# typically N _single_<i> directories plus optionally one _joint directory.
#
# --- about the ".results" Category column and the "_0 / _1" suffix ---
# Each (trait, target_dir) pair is ONE polyfun call; its `ldsc.py --ref-ld-chr`
# always gets exactly two LD-score sources, in this order:
# "<target_dir>/<target>." (index 0) , "<baseline_dir>/<baseline>" (index 1)
# With --overlap-annot, every annotation column in the .results "Category" is
# named <ldscore_column_name>_<ref-ld-index>:
# index 0 = the target file -> "ANNOT_0" (no-snplist; compute_ldscores.py keeps the annot col name)
# -> "L2_0" (snplist + single annot; ldsc.py hard-codes "L2", see below)
# -> "ANNOT_1_0","ANNOT_2_0" (no-snplist joint dir, N>=2 annot cols)
# -> "ANNOT_1L2_0","ANNOT_2L2_0" (snplist joint dir, N>=2 -> "<name>L2")
# index 1 = the baseline file -> "base_1","Coding_UCSC_1", ... (the 97 baseline annots)
# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would
# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.
# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header — ldsc.py's
# "n_annot == 1 -> column name 'L2'" quirk vs compute_ldscores.py keeping the annot
# column name.) [postprocess] auto-detects the target Category; if you instead pass
# --target-categories, the names must match this column exactly.
#
parameter: target_anno_dirs = paths()
parameter: all_traits = []
import os
with open(all_traits_file, 'r') as f:
trait_paths = [os.path.join(sumstat_dir, line.strip()) for line in f if line.strip()]
# Build (trait, target_dir) Cartesian product as parallel flat lists.
input_list = []
target_meta = []
for td in target_anno_dirs:
for t in trait_paths:
input_list.append(t)
target_meta.append(str(td))
input: input_list, group_by = 1, group_with = "target_meta"
output: f"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.log", \
f"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.results"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "${ }"
target_dir="${target_meta[_index]}"
target_name="$(basename ${target_meta[_index]})"
trait="$(basename ${_input[0]})"
output_dir="${cwd:a}/$target_name"
mkdir -p "$output_dir"
# MAF cutoff handling. Only 0 (disabled) or 0.05 (sLDSC default) are supported;
# other values would require recomputing LD scores at that cutoff.
frq_file_check="${frqfile_dir}/${plink_name}22.frq"
if [ "${maf_cutoff}" = "0" ] || [ "${maf_cutoff}" = "0.0" ]; then
echo "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)"
frq_option="--not-M-5-50"
elif [ "${maf_cutoff}" = "0.05" ]; then
if [ -f "$frq_file_check" ]; then
echo "maf_cutoff = 0.05: using --frqfile-chr (MAF > 5%)"
frq_option="--frqfile-chr ${frqfile_dir}/${plink_name}"
else
echo "ERROR: maf_cutoff=0.05 requires .frq files for the reference panel,"
echo " but none found at ${frqfile_dir}/${plink_name}*.frq."
echo " Provide .frq files in frqfile_dir, or set maf_cutoff=0 (NOT recommended)."
exit 1
fi
else
echo "ERROR: maf_cutoff=${maf_cutoff} is not supported. Only 0 (no filter) or"
echo " 0.05 (sLDSC default) are accepted. Other values would require"
echo " recomputing LD scores at that cutoff."
exit 1
fi
run_ldsc() {
local extra_args="$1"
${python_exec} ${polyfun_path}/ldsc.py \
--h2 ${sumstat_dir}/$trait \
--ref-ld-chr "$target_dir/$target_name.","${baseline_ld_dir}/${baseline_name}" \
--out "$output_dir/$trait" \
--overlap-annot \
--w-ld-chr ${weights_dir}/${weight_name} \
$frq_option \
--print-coefficients \
--print-delete-vals \
--n-blocks ${n_blocks} \
$extra_args
}
run_ldsc ""
log_file="$output_dir/$trait.log"
# FloatingPointError retry ladder (preserved from original): 30 -> 20 -> 10
for max in 30 20 10; do
if [ -f "$log_file" ] && grep -q "FloatingPointError\|invalid value encountered in sqrt" "$log_file"; then
echo "FloatingPointError detected, retrying with --chisq-max $max..."
run_ldsc "--chisq-max $max"
else
break
fi
done
if [ -f "$log_file" ] && grep -q "FloatingPointError\|invalid value encountered in sqrt" "$log_file"; then
echo "ERROR: FloatingPointError persists for trait $trait at target $target_name even with --chisq-max 10"
echo "This trait may have severe numerical instability issues in the summary statistics."
fi
[munge_sumstats_polyfun]
parameter: sumstats = path
parameter: n = 0
parameter: min_info = 0.6
parameter: min_maf = 0.001
parameter: keep_hla = False
parameter: chi2_cut = 30
input: sumstats
output: f"{_input:n}.munged.parquet"
bash: expand=True, stderr=f'{_output:nn}.stderr', stdout=f'{_output:nn}.stdout'
{python_exec} {polyfun_path}/munge_polyfun_sumstats.py \
--sumstats {_input} \
--out {_output} \
{'--n {}'.format(n) if n>0 else ''} \
{'--min-info {}'.format(min_info)} \
{'--min-maf {}'.format(min_maf)} \
{'--chi2-cutoff {}'.format(chi2_cut)} \
{'--keep-hla' if keep_hla else ''} \
--remove-strand-ambig
[postprocess]
# Post-processing of polyfun outputs via pecotmr::sldsc_postprocessing_pipeline.
# Reads .results / .log / .part_delete for all traits in `traits_file`, both
# single-target and (when present) joint-target runs, computes Gazal-style
# tau*, EnrichStat with back-solved jackknife SE, and runs the default
# DerSimonian-Laird random-effects meta across all supplied traits. Writes
# one RDS containing per-trait tables and three meta tables (tau*, E, EnrichStat).
parameter: traits_file = path() # text file: one trait sumstats filename per line
parameter: heritability_cwd = path() # parent directory of [get_heritability] outputs (contains <annotation_name>_single_<i>/ subdirs and optionally <annotation_name>_joint/)
parameter: target_categories = [] # target annotation names. Auto-detected from the joint-run results if empty.
parameter: target_categories_label = [] # optional display names, same order as target_categories;
# when given, every "target" column / tau*-block colname in
# the output RDS is renamed to these (params$target_categories
# holds the labels, params$target_categories_orig the originals).
parameter: target_anno_dir = path() # directory of target .annot.gz files used for sd_C and binary detection (typically the joint dir, since it carries all target columns)
input: traits_file
output: f"{cwd:a}/{annotation_name}.sldsc_postprocess.rds"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
library(pecotmr)
traits <- readLines("${traits_file}")
target_cats <- c(${",".join('"%s"' % c for c in target_categories)})
target_lab <- c(${",".join('"%s"' % c for c in target_categories_label)})
# Auto-detect single-target and joint-target output directories.
her_root <- "${heritability_cwd}"
all_subdirs <- list.dirs(her_root, recursive = FALSE)
single_pattern <- paste0("^", "${annotation_name}", "_single_([0-9]+)$")
joint_name <- paste0("${annotation_name}", "_joint")
single_dirs <- all_subdirs[grepl(single_pattern, basename(all_subdirs))]
single_indices <- as.integer(sub(single_pattern, "\\1", basename(single_dirs)))
single_dirs <- single_dirs[order(single_indices)]
joint_dir <- file.path(her_root, joint_name)
has_joint <- dir.exists(joint_dir)
message(sprintf("Detected %d single-target dirs%s",
length(single_dirs),
if (has_joint) "; joint-target dir present" else "; no joint-target dir"))
# Build per-trait prefix maps. Each trait's polyfun output is at <dir>/<trait>
# (polyfun appends .results / .log / .part_delete).
trait_single_prefixes <- lapply(traits, function(t) file.path(single_dirs, t))
names(trait_single_prefixes) <- traits
if (has_joint) {
trait_joint_prefix <- setNames(file.path(joint_dir, traits), traits)
} else {
trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)
}
res <- sldsc_postprocessing_pipeline(
trait_single_prefixes = trait_single_prefixes,
trait_joint_prefix = trait_joint_prefix,
target_anno_dir = "${target_anno_dir}",
frqfile_dir = "${frqfile_dir}",
plink_name = "${plink_name}",
maf_cutoff = ${maf_cutoff},
target_categories = if (length(target_cats) > 0) target_cats else NULL,
target_labels = if (length(target_lab) > 0) target_lab else NULL
)
saveRDS(res, "${_output[0]}")
message("S-LDSC post-processing complete; results written to ${_output[0]}")
[meta_subset]
# Optional: re-run random-effects meta on a user-defined subset of traits, using
# the cached per-trait standardized results from [postprocess]. No regression rerun.
parameter: postprocess_rds = path() # output of [postprocess]
parameter: subset_traits_file = path() # text file: one trait id per line, subset of those passed to [postprocess]
parameter: subset_name = str # label used in the output filename
parameter: target_categories = [] # target annotation names to meta on; if empty, uses all from postprocess output
# If [postprocess] was run with --target-categories-label, the cached RDS already
# carries the display names (params$target_categories = the labels), so leave
# --target-categories empty here (or pass the labels, not the original ANNOT_* names).
input: postprocess_rds, subset_traits_file
output: f"{cwd:a}/{annotation_name}.{subset_name}.meta.rds"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
library(pecotmr)
res <- readRDS("${postprocess_rds}")
subset_traits <- readLines("${subset_traits_file}")
target_cats <- c(${",".join([f'"{c}"' for c in target_categories])})
if (length(target_cats) == 0) target_cats <- res$params$target_categories
subset_per_trait <- res$per_trait[subset_traits]
# Map wide names (tau_star_single/joint) to bare names meta_sldsc_random expects.
view_single <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, "single")
view_joint <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, "joint")
out <- list(
tau_star_single = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, "tau_star")), target_cats),
tau_star_joint = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_joint, c, "tau_star")), target_cats),
enrichment = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, "enrichment")), target_cats),
enrichstat = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, "enrichstat")), target_cats)
)
saveRDS(out, "${_output[0]}")
message("Subset meta complete; results written to ${_output[0]}")