Stratified LD Score Regression#
This notebook implements the pipeline of S-LDSC for LD score and functional enrichment analysis.
Important: the S-LDSC implementation comes from the polyfun package, not the original LDSC from bulik/ldsc GitHub repo.
The purpose of this pipeline is to use LD Score Regression (LDSC) to analyze the heritability and enrichment of genome annotations across GWAS traits. By integrating genome annotation files and GWAS summary statistics, this pipeline allows single tau analysis (individual annotation contributions) and joint tau analysis (independent contributions of multiple annotations after removing shared effects).
The pipeline is developed to integrate GWAS summary statistics data, annotation data, and LD reference panel data to compute functional enrichment for each of the epigenomic annotations that the user provides using the S-LDSC model. We will first start off with an introduction, instructions to set up, and the minimal working examples. Then the workflow code that can be run using SoS on any data will be at the end.
A brief review on Stratified LD score regression#
Here I briefly review LD Score Regression and what it is used for. For more in-depth information on LD Score Regression please read the following three papers:
“LD Score regression distinguishes confounding from polygenicity in genome-wide association studies” by Bulik-Sullivan et al. (2015)
“Partitioning heritability by functional annotation using genome-wide association summary statistics” by Finucane et al. (2015)
“Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection” by Gazal et al. (2017)
As stated in Bulik-Sullivan et al. (2015), confounding factors and polygenic effects can both inflate test statistics, and most existing methods cannot distinguish inflation from confounding bias from a true polygenic signal. LD Score Regression (LDSC) identifies the relative contribution of confounding versus polygenicity directly from GWAS summary statistics.
The approach regresses the relationship between Linkage Disequilibrium (LD) scores and test statistics of SNPs from the GWAS summary statistics. Variants in LD with a “causal” variant show an elevation in test statistics in association analysis proportional to their LD (measured by \(r^2\)) with the causal variant within a certain window size (e.g. 1 cM, 1 kb). In contrast, inflation from confounders such as population stratification arising purely from genetic drift will not correlate with LD. For a polygenic trait, SNPs with a high LD score will have more significant \(\chi^2\) statistics on average than SNPs with a low LD score. Thus, if we regress the \(\chi^2\) statistics from GWAS against LD Score, the intercept minus one is an estimator of the mean contribution of confounding bias to the inflation in the test statistics. The regression model is known as LD Score regression.
LDSC model#
Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to \(1/(p(1-p))\) where \(p\) is the minor allele frequency (MAF), the expected \(\chi^2\) statistic of variant \(j\) is:
where \(N\) is the sample size; \(M\) is the number of SNPs, so that \(h^2/M\) is the average heritability per SNP; \(a\) measures the contribution of confounding biases such as cryptic relatedness and population stratification; and \(\ell_j = \sum_k r^2_{jk}\) is the LD Score of variant \(j\), which measures the amount of genetic variation tagged by \(j\). A full derivation is given in the Supplementary Note of Bulik-Sullivan et al. (2015); an alternative derivation appears in the Supplementary Note of Zhu and Stephens (2017) AoAS.
Equation (1) shows that LD Score regression can compute SNP-based heritability for a phenotype from GWAS summary statistics alone, without requiring individual-level genotype data as REML and related methods do.
Stratified LDSC#
Heritability is the proportion of phenotypic variation that is due to variation in genetic values, and it can also be partitioned over disjoint or overlapping categories of SNPs.
Stratified LD Score Regression (S-LDSC) partitions heritability by leveraging both LD-score information and SNPs that have not reached genome-wide significance. S-LDSC exploits the fact that the \(\chi^2\) statistic for a given SNP reflects the cumulative effects of all SNPs tagged by it: in regions of high LD, the focal SNP captures the contribution of a group of nearby SNPs.
S-LDSC declares an annotation enriched for heritability if SNPs with high LD to that annotation have higher \(\chi^2\) statistics than SNPs with low LD to it.
Let \(a_{jC}\) denote the value of annotation \(C\) at SNP \(j\):
Binary annotation (e.g. an indicator for “in enhancer”, “in exon”, “in cell-type-specific peak”): \(a_{jC} \in \{0, 1\}\).
Continuous annotation (e.g. gene-specificity score, conservation score, continuous epigenomic signal): \(a_{jC} \in \mathbb{R}\).
Under a polygenic model the per-SNP heritability for SNP \(j\) is
and the expected \(\chi^2\) statistic of SNP \(j\) is
where \(\ell(j, C) = \sum_k a_{kC}\, r^2_{jk}\) is the partitioned LD score of SNP \(j\) with respect to annotation \(C\), and \(a\) measures confounding bias. Equation (2) allows joint estimation of all \(\tau_C\) via a (computationally simple) multiple regression of \(\chi^2_j\) against \(\ell(j, C)\).
Interpretation of \(\tau_C\):
Binary \(C\): \(\tau_C\) is the additive increase in per-SNP heritability for SNPs in category \(C\), on top of the contributions from any other annotations they belong to.
Continuous \(C\): \(\tau_C\) is the additive change in per-SNP heritability per unit increase in the value of annotation \(C\).
For application to real data and comparisons to other methods, see the three papers cited at the top of this notebook.
Tau Estimation and Enrichment Analysis#
Goal: quantify the contribution of functional annotations to trait heritability and assess statistical significance, accounting for LD structure and (for continuous annotations) annotation scale.
The pipeline has two computational layers:
Regression layer — the S-LDSC regression itself, performed by the polyfun engine. We do not re-implement this.
Post-processing layer — standardization, differential per-SNP heritability, binary/continuous detection, and random-effects meta-analysis across traits. Implemented in the
pecotmrR package (R/sldsc_wrapper.R).
The notation below tags each modeling quantity as (polyfun) or (pecotmr).
Notation#
For each annotation \(C\) we use:
\(\pi^{h^2}_C\) = proportion of trait heritability \(h^2_g\) assigned to annotation \(C\).
\(\pi^{M}_C\) = proportion of (effective) SNPs in annotation \(C\). For binary annotations this is \(M_C / M_{\mathrm{ref}}\); for continuous annotations it is the share of total annotation weight in \(C\).
Reference panel and MAF cutoff#
All LD-derived quantities — partitioned LD scores for the 97 baseline annotations and for our \(K\) target annotations, the LD-score-regression weights, allele frequencies, and the SNP set — are computed against our own LD reference panel. We do not mix in pre-computed quantities from external panels (e.g. 1000G); \(M_{\mathrm{ref}}\) throughout this notebook denotes the number of common SNPs in our panel.
By default we restrict to MAF \(> 5\%\) per the sLDSC recommendation: rare-variant LD is unstable and HapMap3-style regression weights are common-variant by construction. The cutoff is exposed as the SoS parameter maf_cutoff (default \(0.05\)); the regression, the standardized \(sd_C\), and \(M_{\mathrm{ref}}\) are all evaluated on the same MAF \(>\) cutoff SNP set. If allele-frequency files are not available the pipeline fails; the user must explicitly set maf_cutoff = 0 to opt out (not recommended).
Quantities from the regression layer (polyfun)#
Solving Equation (2) jointly across annotations, with 200-block genomic jackknife for inference, is performed by polyfun’s ldsc.py. From each polyfun run we obtain, per annotation:
\(\tau_C\) and its standard error — (polyfun).
\(\pi^{h^2}_C\) and \(\pi^{M}_C\) — (polyfun).
\(E_C = \pi^{h^2}_C / \pi^{M}_C\) and its standard error — (polyfun).
The p-value of the differential per-SNP heritability test (defined below) — (polyfun), computed internally with the full coefficient covariance matrix.
We also obtain, per run:
The total trait heritability \(h^2_g\) — (polyfun).
The 200-block jackknife delete-values of \(\tau_C\) — (polyfun).
Quantities from the post-processing layer (pecotmr)#
From the polyfun outputs above plus our reference panel, the post-processing layer computes:
\(sd_C\) — per-annotation standard deviation over MAF \(>\) cutoff SNPs — (pecotmr:
compute_sldsc_annot_sd).\(M_{\mathrm{ref}}\) — reference SNP count at the MAF cutoff — (pecotmr:
compute_sldsc_M_ref).Whether each annotation is binary or continuous — (pecotmr:
is_binary_sldsc_annot).\(\tau^*_C\) point estimate and per-block \(\tau^*_C\) — (pecotmr:
standardize_sldsc_trait).EnrichStat point estimate and its standard error (formula below) — (pecotmr:
standardize_sldsc_trait).DerSimonian-Laird random-effects meta-analysis of \(\tau^*_C\), \(E_C\), or EnrichStat across traits — (pecotmr:
meta_sldsc_random).
The top-level entry point pecotmr::sldsc_postprocessing_pipeline orchestrates all of the above.
Standardized tau (\(\tau^*\)) — (pecotmr)#
\(\tau_C\) has units that depend on the scale of the annotation and on the total heritability of the trait, so raw \(\tau\) is not directly comparable across annotations or across traits. We compute the standardized version (Gazal et al. 2017)
interpreted as the additive change in per-SNP heritability associated with a 1 standard deviation increase in annotation \(C\), divided by the average per-SNP heritability across all SNPs. \(\tau^*_C\) is dimensionless and comparable across annotations and across traits. In a joint multi-annotation regression it is the independent contribution of annotation \(C\) after controlling for overlapping effects of the others.
Here \(sd_C\) is the standard deviation of annotation \(C\) across reference SNPs (MAF \(>\) cutoff), \(M_{\mathrm{ref}}\) is the count of those SNPs, and \(h^2_g\) is the trait heritability. Applying the same scaling to each of the 200 jackknife blocks yields per-block \(\tau^*_C\) values; their sample variance gives the jackknife standard error $\(SE^{\text{jackknife}}(\tau^*_C) \;=\; \sqrt{\,\tfrac{(B-1)^2}{B}\, \mathrm{Var}_b(\tau^*_{C,(b)})\,}\)\( with \)B = 200$, used as the per-trait input to cross-trait meta-analysis.
Differential per-SNP heritability (“EnrichStat”) — (polyfun + pecotmr)#
To test whether the per-SNP heritability inside annotation \(C\) differs from outside it (Finucane et al. 2015):
The point-estimate p-value of this test is computed by polyfun internally using the full coefficient covariance and reported as Enrichment_p. Its standard error is recovered from the reported p-value:
This per-trait point + SE is the input to cross-trait meta-analysis.
Reporting: binary vs. continuous annotations — (pecotmr)#
The estimation machinery applies to both annotation types, but the headline quantity to report within each type differs.
For a binary annotation (e.g. enhancer indicator, exon, in/out of a cell-type peak), \(\pi^{M}_C = M_C / M_{\mathrm{ref}}\) has a direct interpretation and \(E_C\) reads as “the category explains \(E_C\)-fold more heritability than its share of SNPs.” The within-type headline quantities are therefore \(E_C\) and the EnrichStat p-value; \(\tau^*_C\) is reported alongside.
For a continuous annotation (e.g. gene-specificity score, conservation score, continuous epigenomic signal), \(E_C\) depends on the scale of the annotation: rescaling the annotation by a constant changes \(E_C\) even though the underlying biology is unchanged. The within-type headline quantities are therefore \(\tau^*_C\) and its p-value; \(E_C\) is reported alongside but should not be interpreted for continuous annotations.
The pipeline determines whether an annotation is binary by inspecting whether its values lie in \(\{0, 1\}\) and selects the appropriate within-type headline statistic automatically (pecotmr).
From the official LDSC tutorial (Partitioned Heritability from Continuous Annotations):
“Enrichment is (Prop. heritability) / (Prop. SNPs). These outputs make sense only for binary annotations. Do not try to interpret them for continuous annotations. Using
--print-coefficientsoutputs the regression coefficients and corresponding standard errors and Z score for each annotation. These coefficients measure the additional contribution of one annotation to the model and are interpretable for both binary and continuous annotations.”The pipeline always passes
--print-coefficientsto polyfun for this reason.
Cross-type comparison: always use \(\tau^*_C\) — (pecotmr)#
For an apple-to-apple comparison across binary and continuous annotations — ranking annotations on a single axis, meta-analyzing a mixed set, or reporting a leaderboard that pools both types — use \(\tau^*_C\). The standardization in Gazal et al. (2017) was designed for exactly this purpose: \(sd_C = \sqrt{p(1-p)}\) for a binary annotation (where \(p\) is the proportion in the category) and \(sd_C = \) empirical standard deviation for a continuous annotation, so the resulting \(\tau^*_C\) is dimensionless and has the same interpretation in both cases — additive change in per-SNP heritability per 1 SD increase in the annotation, normalized by the average per-SNP heritability. \(E_C\) does not have this property and must not be compared across types.
The pipeline emits both \(E_C\) and \(\tau^*_C\) for every annotation, with the binary/continuous flag, so callers can pick the right column for the comparison they are making.
Joint analysis — (polyfun runs the regression; pecotmr standardizes both modes)#
For joint analysis (multiple annotations fit together), both \(\tau\) and \(E\) are conditional on the other annotations in the model. We report joint \(\tau^*_C\) as the independent contribution of annotation \(C\) after controlling for the others. The annotation-prep step exposes two independent toggles, compute_single and compute_joint (both default True), so the user can produce the \(N\) single-target outputs, the joint output, or both in one invocation. With both defaults the post-processing layer reads all \(N+1\) regression outputs per trait and presents single + joint side-by-side. When the joint subset is decided after looking at single-target results (exploratory \(\rightarrow\) conditional workflow), the user runs the annotation-prep step a second time with compute_single=False on the curated subset.
Meta-Analysis across Traits (Random Effects) — (pecotmr)#
DerSimonian-Laird random-effects meta-analysis of per-annotation estimates across traits, implemented in pecotmr::meta_sldsc_random (which delegates the numerics to rmeta::meta.summaries(..., method = "random")):
where \(\hat\theta_i\) is the per-trait estimate and \(SE_i\) its standard error:
For \(\tau^*_C\) meta: \(SE_i\) is the jackknife SE from the per-block \(\tau^*_C\) values.
For \(E_C\) meta: \(SE_i\) is the polyfun-reported
Enrichment_std_error.For EnrichStat meta: \(SE_i\) is the back-solved SE from polyfun’s
Enrichment_p.
For binary-annotation enrichment reporting we use a two-channel meta: the effect size and SE come from the meta on \(E_C\) (interpretable on the original enrichment-fold scale), while the p-value comes from the meta on EnrichStat (the appropriate hypothesis test). The pipeline produces a default meta over all supplied traits; users can re-run meta on any subset of traits without re-running the regression layer.
Input#
1. Target Annotation File#
Purpose: Specifies the user-provided (“target”) genome annotation files. The pipeline supports both binary and continuous annotations; the type is auto-detected per annotation column.
Formats:
Text file (
.txt) listing per-chromosome paths to annotation files. Annotation files can be.rds/.tsv/.txt.Alternatively, files for specific chromosomes can be provided directly.
Multiple target annotations are supported in one input file (one column per annotation, prefixed
path,path1,path2, …). Single-target and joint-target analyses are produced automatically in one pipeline pass.Format (the score column is optional; if absent, score is set to 1):
is_range = False:
chr pos score 1 10001 1 1 10002 1
is_range = True:
chr start end score 1 10001 20001 1 1 30001 40001 1
2. Reference Annotation File (baseline-LD)#
Purpose: Provides the baseline annotations (typically the 97-annotation baseline-LD model from Gazal et al. 2017) in
.annot.gzformat for each chromosome. The baseline conditions every regression.Formats:
Text file listing baseline annotation files for all chromosomes.
Alternatively, files for specific chromosomes can be provided directly.
3. Genome Reference File#
Purpose: PLINK-format
.bed/.bim/.famfiles for our LD reference panel, per chromosome. This is the panel against which all LD-derived quantities (target LD scores, baseline LD scores, regression weights, allele frequencies) must be computed. Do not mix files derived from different panels (e.g. 1000G vs ADSP).Formats:
Text file listing per-chromosome reference files, or files for specific chromosomes.
4. SNP List#
Purpose: Specifies the SNPs to include in LDSC analysis (typically a HapMap3-style list).
Format: A list of
rsids, one per line.
5. Allele Frequency Files (.frq, our panel)#
Purpose: PLINK
.frqfiles for the reference panel, used to enforce the MAF cutoff. Required whenmaf_cutoff > 0(default0.05); the pipeline fails if missing unlessmaf_cutoff = 0is explicitly set.
6. GWAS Summary Statistics#
Purpose: One munged sumstats file per trait, listed in a text file (
all_traits_file). The pipeline runs the regression once per trait per single/joint mode.Format:
CAD_META.filtered.sumstats.gz UKB.Lym.BOLT.sumstats.gz
Workflow Steps#
The pipeline is organized in three logical stages: (1) input preparation and the S-LDSC regression itself (handled by polyfun), (2) post-processing into standardized \(\tau^*\) and meta-analysis (handled by R wrappers in the pecotmr package), and (3) optional re-meta on user-defined trait subsets.
Step 1: Annotation Preparation and S-LDSC Regression (polyfun)#
Three SoS workflows wrap polyfun:
make_annotation_files_ldscore— converts user-provided target annotations into polyfun’s.annot.gzformat and runscompute_ldscores.pyagainst our reference panel. Two independent toggles control what is produced:compute_single(defaultTrue): produce \(N\) single-target LD-score directories, one per target annotation in the input file.compute_joint(defaultTrue): produce one joint LD-score directory bundling all \(N\) targets together (only emitted when \(N \geq 2\)).
Both defaults true means the typical Workflow A invocation produces everything in one pass. For Workflow B (exploratory single first, joint later on a curated subset), run this step twice: once with
--no-compute-jointon the candidate set; later with--no-compute-singleon the curated subset (passing a newannotation_filecontaining just those annotations). All outputs are cached on disk and reused across traits and polyfun runs via SoS’s output-timestamp check.munge_sumstats_polyfun— preprocesses each GWAS into LDSC-compatible format.get_heritability— for each trait, runs polyfun’sldsc.pyonce pertarget_anno_dirsupplied. Pass the list of single-target dirs (and the joint dir, if produced) as--target-anno-dirs. Output goes to<cwd>/<basename(target_dir)>/<trait>.{results,log,part_delete}. MAF \(>\) cutoff is enforced via--frqfile-chr;maf_cutoffaccepts only0or0.05.
Step 2: Post-Processing (pecotmr::sldsc_postprocessing_pipeline)#
A single R function call consumes all polyfun outputs for the run and produces the final tables:
Reads each polyfun output and extracts \(\tau\), \(E\), \(h^2_g\), EnrichStat p-value, and per-block jackknife \(\tau\) values.
Computes annotation \(sd_C\) and \(M_{\mathrm{ref}}\) over the same MAF \(>\) cutoff SNP set as the regression.
Standardizes \(\tau \to \tau^*\) for both single-tau and joint-tau modes, including the per-block versions for jackknife SE.
Auto-detects whether each annotation is binary or continuous and tags every output row accordingly.
Reports the number and names of baseline annotations encountered (via
message()) for transparency.Runs the default DerSimonian-Laird random-effects meta-analysis across all supplied traits, producing three meta tables: \(\tau^*\) (cross-type comparable), \(E\) (within-binary), and EnrichStat (within-type).
Outputs are returned as an R list with two top-level entries: per_trait (one tidy data frame per trait, single + joint estimates side-by-side per target) and meta (three tables, one per quantity, with rows = target annotations and columns = single/joint mean/SE/p plus an is_binary flag).
Step 3: Optional Subset Meta-Analysis (pecotmr::meta_sldsc_random)#
The default meta in Step 2 pools all traits the user supplied. To re-run the meta on a subset (e.g., neurodegenerative traits only, or autoimmune traits only) without re-running the regression layer:
res <- readRDS("sldsc_results.rds")
neuro <- c("AD_GWAX", "PD_meta", "ALS_meta")
meta_neuro_taustar <- pecotmr::meta_sldsc_random(
res$per_trait[neuro], category = "my_target_anno", quantity = "tau_star"
)
This step is light-weight and can be run interactively.
Output summary#
Stage |
Cached on disk |
Recomputable from |
Purpose |
|---|---|---|---|
Target LD scores |
per-annotation, once |
annotation + reference panel |
input to every regression |
polyfun |
yes |
regression run |
\(\tau\), \(E\), EnrichStat |
Per-trait standardized table |
yes (RDS) |
polyfun outputs + \(sd_C\) + \(M_{\mathrm{ref}}\) |
reporting + meta |
Default meta tables |
yes (RDS) |
per-trait standardized |
headline figures |
Subset meta |
re-run on demand |
per-trait standardized |
custom analyses |
MWE:#
1. make_annotation_files_ldscore#
annotation file can be a txt file with #id, and path1 path2 …, also can be rds files seperate by ‘,’
1.1 single tau analysis, with one annotation as a input#
# case 1: txt file as input
sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \
--annotation_file data/polyfun/input/colocboost_test_annotation_path.txt \
--reference_anno_file data/polyfun/input/reference_annotation0.txt \
--genome_ref_file data/polyfun/input/genome_reference_bfile.txt \
--annotation_name test_colocboost \
--plink_name reference. \
--baseline_name annotations. \
--weight_name weights. \
--python_exec python \
--polyfun_path data/github/polyfun \
--cwd output/polyfun/ -j 22
Alternatively, we can also use files with specific chromosome, instead of txt list.
# single file format
sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \
--annotation_file data/polyfun/input/colocboost_test.tsv \
--reference_anno_file data/polyfun/example_annot0/annotations.1.annot.gz \
--genome_ref_file data/polyfun/example_data/reference.1.bed \
--annotation_name test_colocboost \
--plink_name reference. \
--baseline_name annotations. \
--weight_name weights. \
--python_exec python \
--polyfun_path data/github/polyfun \
--cwd output/polyfun/ --chromosome 1
1.2 joint tau#
with more than one annotation as the input
--snp_list <file> is optional and orthogonal to single/joint: it restricts which SNP rows are written to the .l2.ldscore output (HM3-style restriction) — it does not change the LD-score values of retained SNPs (the r²-window still uses all .bim SNPs), nor .l2.M / .l2.M_5_50. With --snp_list the step runs polyfun’s ldsc.py --print-snps and emits .l2.ldscore.gz (without it: compute_ldscores.py → .l2.ldscore.parquet), and the target annot is normalized to the plink .bim. Downstream get_heritability / postprocess commands are unchanged — they just read whichever .l2.ldscore.{gz,parquet} exists; the regression runs on the intersection sumstats ∩ baseline ∩ weights ∩ target, so for the HM3 restriction to take full effect the baseline & weights LD scores should also be HM3-restricted.
Which polyfun script & what to align. --snp_list present → step 1 runs ldsc.py --l2 --print-snps (output .l2.ldscore.gz); absent → compute_ldscores.py (output .l2.ldscore.parquet). ldsc.py requires the target annot to line up with the plink .bim — same SNP set and same row order. The pipeline handles this internally (normalize_for_ldsc re-expands the annot onto the .bim rows, filling 0), but if you assemble inputs by hand, keep the plink .bim, .frq, target annot, and baseline annot all on the same reference panel with consistent SNP set/order: target and baseline annot row counts must match (polyfun hstacks them), and the .frq MAF filter (--frqfile-chr, m50) plus the by-SNP merge in get_heritability both assume a consistent panel. Mixing panels (e.g. 1000G plink with an ADSP-derived annot, or a .frq whose row count/order differs from the annot) is the most common source of wrong Prop_SNPs / Enrichment.
# Annotation prep: an annotation_file with multiple `path*` columns yields
# N single-target output directories plus 1 joint directory in one call.
# To do Workflow B step 1 (singles only), pass --no-compute-joint.
# To do Workflow B step 2 (joint only on a curated subset), pass --no-compute-single
# with a new annotation_file containing just the curated annotations.
sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \
--annotation_file data/quantile_qtl_annotation/AC_DeJager_eQTL.multi_target_example.txt \
--reference_anno_file data/reference_annotation.txt \
--genome_ref_file data/genome_reference_bfile.txt \
--snp_list data/1000G_EUR_Phase3_hg38/list.txt \
--annotation_name my_analysis \
--cwd output \
--chromosome 1
# Produces (with defaults: compute_single=TRUE, compute_joint=TRUE):
# output/my_analysis_single_1/ ... output/my_analysis_single_N/ (compute_single)
# output/my_analysis_joint/ (compute_joint, N>=2)
# Direct file-path mode: comma-separated annotation files (single-chromosome run).
# Same toggle semantics: compute_single (default TRUE) and compute_joint (default TRUE).
sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \
--annotation_file data/AC_DeJager_eQTL.unique_qr.rds,data/AC_DeJager_eQTL.shared_homo.rds \
--reference_anno_file data/example_anno/ABC_Road_GI_BRN.1.annot.gz \
--genome_ref_file data/plink_files/1000G.EUR.hg38.1.bed \
--snp_list data/1000G_EUR_Phase3_hg38/list.txt \
--annotation_name my_analysis \
--cwd output/mwe \
--chromosome 1
2. get_heritability#
# Run polyfun across all (trait, target_dir) pairs in one invocation.
# target_anno_dirs is the list of output dirs from [make_annotation_files_ldscore]:
# the N single-target dirs plus optionally the joint dir.
sos run pipeline/sldsc_enrichment.ipynb get_heritability \
--target-anno-dirs output/my_analysis_single_1 output/my_analysis_single_2 output/my_analysis_joint \
--all-traits-file data/all_traits.txt \
--sumstat-dir data/sumstats \
--baseline-ld-dir data/baselineLD_v2.2_ADSP \
--frqfile-dir data/ADSP/frq \
--weights-dir data/ADSP_weights \
--plink-name ADSP_chr \
--baseline-name baseline_chr \
--weight-name weights_chr \
--annotation-name my_analysis \
--cwd output/heritability \
--maf-cutoff 0.05
# (allm variant: use --maf-cutoff 0 — no MAF filter; polyfun then uses --not-M-5-50)
3. Post-processing (pecotmr) and meta-analysis#
The [postprocess] step reads all polyfun outputs under heritability_cwd
(which contains the \(N\) single-target subdirectories and optionally the
joint subdirectory) and calls pecotmr::sldsc_postprocessing_pipeline()
to produce per-trait standardized tables and the default random-effects
meta across all traits.
Use --target-categories-label (same order as --target-categories) to give the target annotations friendly names in the output — e.g. --target-categories ANNOT_1_0 ANNOT_2_0 --target-categories-label quantile_eQTL eQTL makes the target column read quantile_eQTL / eQTL instead of ANNOT_1_0 / ANNOT_2_0 (the original names are kept in params$target_categories_orig). Omit it to keep the polyfun .results names.
# Post-processing: read all polyfun outputs (single-target + joint, all traits)
# under heritability_cwd and run the default meta-analysis via
# pecotmr::sldsc_postprocessing_pipeline. Single and joint target subdirectories
# are auto-detected from <annotation_name>_single_<i> and <annotation_name>_joint.
# --target-categories: the joint-run target columns as they appear in the .results "Category"
# column. The 2-target joint dir here has columns ANNOT_1, ANNOT_2 (compute_ldscores.py keeps
# the annot col names) and polyfun appends "_0" (target = ref-ld file 0) -> "ANNOT_1_0", "ANNOT_2_0".
# (Single-target dirs would be "ANNOT_0"; with --snp_list/ldsc.py and a single annot it is "L2_0".)
# You can drop --target-categories entirely to auto-detect from the joint-run results.
# --target-categories-label (optional, same order as --target-categories): friendly display
# names; renames the "target" column in the output (originals kept in params$target_categories_orig).
sos run pipeline/sldsc_enrichment.ipynb postprocess \
--traits-file data/all_traits.txt \
--heritability-cwd output/heritability \
--target-categories ANNOT_1_0 ANNOT_2_0 \
--target-categories-label quantile_eQTL eQTL \
--target-anno-dir output/my_analysis_joint \
--frqfile-dir data/ADSP/frq \
--plink-name ADSP_chr \
--annotation-name my_analysis \
--cwd output/sldsc_postprocess \
--maf-cutoff 0.05
# (allm variant: use --maf-cutoff 0 — no MAF filter; polyfun then uses --not-M-5-50)
4. Optional: subset meta-analysis#
The default meta in step 3 pools all traits supplied to [postprocess]. Use [meta_subset] to re-run the meta on a user-defined trait subset (e.g., neurodegenerative traits only, autoimmune traits only) without re-running the regression or the per-trait standardization. The subset operates on the cached .sldsc_postprocess.rds output; it is light-weight and can be run interactively or in batch.
# Re-run random-effects meta on a subset of traits without recomputing the regression layer.
sos run pipeline/sldsc_enrichment.ipynb meta_subset \
--postprocess-rds output/sldsc_postprocess/my_analysis.sldsc_postprocess.rds \
--subset-traits-file data/neuro_traits.txt \
--subset-name neuro \
--target-categories ANNOT_1_0 ANNOT_2_0 \
--annotation-name my_analysis \
--cwd output/sldsc_postprocess
[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Prefix for the analysis output
parameter: annotation_name = str
parameter: python_exec = "python" # e.g. "/home/you/.conda/envs/polyfun/bin/python"
parameter: polyfun_path = path # e.g. "/home/you/tools/polyfun"
# MAF cutoff for sLDSC. Default 0.05 per sLDSC recommendation (rare-variant LD is unstable
# and HapMap3-style regression weights are common-variant by construction).
# Set to 0 to opt out of MAF filtering (NOT recommended; only use if you understand the implications).
# Other values would require recomputing LD scores at that cutoff.
parameter: maf_cutoff = 0.05
# for make_annotation_files_ldscore workflow:
parameter: annotation_file = path()
parameter: reference_anno_file = path()
parameter: genome_ref_file = path() # with .bed
parameter: chromosome = []
parameter: snp_list = path()
parameter: ld_wind_kb = 0 # use kb if the value is provided
parameter: ld_wind_cm = 1.0 # default using ld_wind_cm
# for get_heritability workflow.
# Note: all LD-derived inputs (baseline LD scores, target LD scores, regression weights,
# allele frequencies) must be computed against the same reference panel as `genome_ref_file`.
# Do not mix files derived from different reference panels (e.g., 1000G vs ADSP).
parameter: all_traits_file = path() # txt file, each row contains all GWAS summary statistics name: e.g. CAD_META.filtered.sumstats.gz
parameter: sumstat_dir = path() # Directory containing GWAS summary statistics
parameter: target_anno_dir = path() # Directory containing target annotation files: output of ldscore
parameter: baseline_ld_dir = path() # Directory containing baseline LD score files (computed against our panel)
parameter: frqfile_dir = path() # Directory containing allele frequency files (.frq, our panel)
parameter: plink_name = "ADSP_chr"
parameter: weights_dir = path() # Directory containing LD weights (computed against our panel)
parameter: baseline_name = "baseline_chr" # Prefix of baseline annotation files
parameter: weight_name = "weights_chr" # Prefix of LD weights files
parameter: n_blocks = 200
# Number of threads
parameter: numThreads = 16
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
Make Annotation File#
[make_annotation_files_ldscore]
# Annotation preparation. Takes one annotation_file with N target annotations
# and produces, in one invocation, any combination of:
# - N single-target LD-score directories (when compute_single = TRUE, default)
# - 1 joint LD-score directory containing all N (when compute_joint = TRUE
# and N >= 2, default)
#
# Outputs per chromosome <chr>:
# <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.annot.gz (i in 1..N, when compute_single)
# <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.ldscore.{parquet|gz}
# <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M
# <cwd>/<annotation_name>_single_<i>/<annotation_name>_single_<i>.<chr>.l2.M_5_50 (when .frq present)
#
# <cwd>/<annotation_name>_joint/<annotation_name>_joint.<chr>.{...} (when compute_joint and N>=2)
#
# Workflows:
# - Workflow A ("all at once"): compute_single=TRUE, compute_joint=TRUE (defaults).
# Produces both, fits the case where you have already chosen the joint set.
# - Workflow B ("exploratory then conditional"):
# Step 1: compute_single=TRUE, compute_joint=FALSE.
# Run on N candidate annotations -> N single-target dirs.
# Inspect single-target results, identify K significant ones.
# Step 2: compute_single=FALSE, compute_joint=TRUE.
# Run on a NEW annotation_file with the K selected annotations
# -> 1 joint dir with the conditional model.
#
# --- snplist (--snp_list) vs no-snplist: which polyfun script, output format,
# column name, and the CM requirement ---
# --snp_list given -> ldsc.py --l2 --print-snps -> output .l2.ldscore.gz
# --snp_list absent -> compute_ldscores.py -> output .l2.ldscore.parquet
#
# LD-score column name (this is what becomes the .results "Category" in
# [get_heritability], with a "_<ref-ld-index>" suffix appended there):
# * compute_ldscores.py ALWAYS keeps the annot column name(s):
# single annot column "ANNOT" -> ldscore column "ANNOT"
# joint annot columns "ANNOT_1","ANNOT_2",... -> "ANNOT_1","ANNOT_2",...
# * ldsc.py --l2 has a quirk: with EXACTLY ONE annotation (n_annot == 1) it
# HARD-CODES the ldscore column name to "L2" and DROPS the annot's original
# column name. With >=2 annotations it uses "<annot_name>L2"
# ("ANNOT_1L2","ANNOT_2L2",...).
# => a single-target snplist run reports "L2_0" in .results, while a
# single-target no-snplist run reports "ANNOT_0". [postprocess] auto-
# detects either; only matters if you pass --target-categories explicitly.
#
# CM column requirement for snplist: ldsc.py --l2 --print-snps requires the
# target annot to (a) carry a "CM" (centimorgan) column and (b) line up with
# the plink .bim (same SNP set, same row order). This step handles both
# internally (normalize_for_ldsc: takes CM from the .bim 4th column, re-expands
# the annot onto the .bim rows, filling 0). Therefore the plink .bim files MUST
# carry genetic-map (cM) positions when using --ld-wind-cm (the default);
# if your .bim has 0 in the cM column, switch to --ld-wind-kb instead.
#
parameter: compute_single = True
parameter: compute_joint = True
parameter: score_column = 3
parameter: is_range = False
import pandas as pd
import os
if not (compute_single or compute_joint):
raise ValueError("[make_annotation_files_ldscore] at least one of compute_single or compute_joint must be TRUE")
def adapt_file_path(file_path, reference_file):
reference_path = os.path.dirname(reference_file)
if os.path.isfile(file_path):
return file_path
file_name = os.path.basename(file_path)
if os.path.isfile(file_name):
return file_name
file_in_ref_dir = os.path.join(reference_path, file_name)
if os.path.isfile(file_in_ref_dir):
return file_in_ref_dir
file_prefixed = os.path.join(reference_path, file_path)
if os.path.isfile(file_prefixed):
return file_prefixed
raise FileNotFoundError(f"No valid path found for file: {file_path}")
# ---- Parse inputs and determine N ----
if (str(annotation_file).endswith(('rds', 'tsv', 'txt', 'tsv.gz', 'txt.gz')) and
str(reference_anno_file).endswith('annot.gz')):
# Case 1: direct file paths (single-chromosome run). Multiple target files separated by ','.
target_files_direct = str(annotation_file).split(',')
N_targets = len(target_files_direct)
target_names = [f"target_{i+1}" for i in range(N_targets)]
input_files = [[*target_files_direct, str(reference_anno_file), str(genome_ref_file)]]
if len(chromosome) > 0:
input_chroms = [int(x) for x in chromosome]
else:
input_chroms = [0]
else:
# Case 2: txt list with #id and one or more 'path' columns
target_files_df = pd.read_csv(annotation_file, sep="\t")
reference_files = pd.read_csv(reference_anno_file, sep="\t")
genome_ref_files = pd.read_csv(genome_ref_file, sep="\t")
target_files_df["#id"] = [x.replace("chr", "") for x in target_files_df["#id"].astype(str)]
reference_files["#id"] = [x.replace("chr", "") for x in reference_files["#id"].astype(str)]
genome_ref_files["#id"] = [x.replace("chr", "") for x in genome_ref_files["#id"].astype(str)]
path_columns = [c for c in target_files_df.columns if c.startswith('path')]
N_targets = len(path_columns)
target_names = path_columns[:] # 'path', 'path1', 'path2', ...
for col in path_columns:
target_files_df[col] = target_files_df[col].apply(lambda x: adapt_file_path(x, str(annotation_file)))
reference_files["path"] = reference_files["path"].apply(lambda x: adapt_file_path(x, str(reference_anno_file)))
genome_ref_files["path"] = genome_ref_files["path"].apply(lambda x: adapt_file_path(x, str(genome_ref_file)))
merged = target_files_df.merge(reference_files, on="#id").merge(genome_ref_files, on="#id")
if len(chromosome) > 0:
merged = merged[merged["#id"].isin([str(c) for c in chromosome])]
rows = merged.values.tolist()
input_chroms = [r[0] for r in rows]
input_files = [[*r[1:N_targets+1], r[-2], r[-1]] for r in rows]
# ---- Determine output format ----
use_print_snps = snp_list.is_file()
ldscore_ext = "l2.ldscore.gz" if use_print_snps else "l2.ldscore.parquet"
if ld_wind_kb > 0:
use_kb_window = True
ld_window_param = ld_wind_kb
ld_window_flag = "--ld-wind-kb"
else:
use_kb_window = False
ld_window_param = ld_wind_cm
ld_window_flag = "--ld-wind-cm"
emit_single = compute_single
emit_joint = compute_joint and N_targets >= 2
# ---- Build per-chromosome output list ----
def chrom_outputs(chrom):
outs = []
if emit_single:
for i in range(N_targets):
name = f"{annotation_name}_single_{i+1}"
prefix = f"{cwd:a}/{name}/{name}.{chrom}"
outs += [f"{prefix}.annot.gz", f"{prefix}.{ldscore_ext}", f"{prefix}.l2.M"]
if emit_joint:
name = f"{annotation_name}_joint"
prefix = f"{cwd:a}/{name}/{name}.{chrom}"
outs += [f"{prefix}.annot.gz", f"{prefix}.{ldscore_ext}", f"{prefix}.l2.M"]
return outs
input: input_files, group_by = N_targets + 2, group_with = "input_chroms"
output: chrom_outputs(input_chroms[_index])
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
# ----------------------------------------------------------------------------
# Step A: write the requested .annot files for this chromosome.
# ----------------------------------------------------------------------------
R: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
library(data.table)
clean_chr <- function(x) as.numeric(gsub("^chr", "", x))
process_range_data <- function(data, chr_value) {
data$chr <- clean_chr(data$chr)
data <- data[data$chr == chr_value,]
if (nrow(data) == 0) return(NULL)
expanded <- lapply(seq_len(nrow(data)), function(j) {
row <- data[j,]
pos_seq <- seq(row$start, row$end - 1)
result <- data.frame(chr = rep(row$chr, length(pos_seq)), pos = pos_seq)
if (ncol(data) > 3) {
for (col in 4:ncol(data))
result[[names(data)[col]]] <- rep(row[[col]], length(pos_seq))
}
result
})
unique(rbindlist(expanded))
}
process_annotation <- function(target_anno, ref_anno, score_column_value) {
target_anno <- as.data.frame(target_anno)
ref_anno <- as.data.frame(ref_anno)
target_anno$chr <- clean_chr(target_anno$chr)
ref_anno$CHR <- clean_chr(ref_anno$CHR)
chr_value <- unique(ref_anno$CHR)
anno_scores <- rep(0, nrow(ref_anno))
match_pos <- match(target_anno$pos, ref_anno$BP)
valid_pos <- as.numeric(na.omit(match_pos))
if (score_column_value <= ncol(target_anno)) {
anno_scores[valid_pos] <- target_anno[[score_column_value]][!is.na(match_pos)]
} else {
anno_scores[valid_pos] <- 1
print("Warning: score column does not exist; setting scores to 1")
}
anno_scores
}
read_target_anno <- function(file_path, ref_anno) {
if (endsWith(file_path, "rds")) {
target_anno <- readRDS(file_path)
return(process_annotation(target_anno, ref_anno, ${score_column}))
}
target_anno <- fread(file_path)
if (${"TRUE" if is_range else "FALSE"}) {
names(target_anno)[1:3] <- c("chr", "start", "end")
target_anno <- process_range_data(target_anno, unique(ref_anno$CHR))
if (is.null(target_anno)) return(rep(0, nrow(ref_anno)))
} else {
names(target_anno)[1:2] <- c("chr", "pos")
}
process_annotation(target_anno, ref_anno, ${score_column})
}
# ---- Read reference annotation ----
ref_anno <- as.data.frame(fread(${_input[-2]:ar}))
if ("ANNOT" %in% colnames(ref_anno)) ref_anno <- ref_anno[, -which(colnames(ref_anno) == "ANNOT")]
# ---- Compute per-target annotation scores ----
target_files <- c(${",".join('"%s"' % str(p.absolute()) for p in _input[:-2])})
N_local <- length(target_files)
score_list <- lapply(target_files, read_target_anno, ref_anno = ref_anno)
emit_single_local <- ${"TRUE" if emit_single else "FALSE"}
emit_joint_local <- ${"TRUE" if emit_joint else "FALSE"}
use_print_snps_local <- ${"TRUE" if use_print_snps else "FALSE"}
bfile_prefix <- "${_input[-1]:na}"
# Reshape annot to match .bim panel for ldsc.py --l2 --print-snps
# (drop A1/A2/MAF, expand to .bim rows filling 0, take CM from .bim).
normalize_for_ldsc <- function(df) {
if (!use_print_snps_local) return(df)
df <- df[, !names(df) %in% c("A1", "A2", "MAF", "CM"), drop = FALSE]
annot_cols <- setdiff(names(df), c("CHR", "BP", "SNP"))
bim <- as.data.frame(fread(paste0(bfile_prefix, ".bim"), header = FALSE,
col.names = c("CHR", "SNP", "CM", "BP", "A1", "A2")))
bim$CHR <- as.character(bim$CHR); df$CHR <- as.character(df$CHR)
idx <- match(bim$SNP, df$SNP)
out <- data.frame(CHR = bim$CHR, BP = bim$BP, SNP = bim$SNP, CM = bim$CM,
stringsAsFactors = FALSE)
for (col in annot_cols) {
v <- rep(0, nrow(bim))
non_na <- !is.na(idx)
v[non_na] <- df[[col]][idx[non_na]]
out[[col]] <- v
}
out
}
# ---- Write N single-target .annot files (when requested) ----
if (emit_single_local) {
for (i in seq_len(N_local)) {
out_anno <- ref_anno
out_anno$ANNOT <- score_list[[i]]
out_anno <- normalize_for_ldsc(out_anno)
name <- paste0("${annotation_name}", "_single_", i)
out_path_gz <- file.path("${cwd:a}", name, paste0(name, ".${input_chroms[_index]}.annot.gz"))
out_path_tsv <- sub("\\.gz$", "", out_path_gz)
dir.create(dirname(out_path_gz), showWarnings = FALSE, recursive = TRUE)
fwrite(out_anno, out_path_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
}
}
# ---- Optionally write joint .annot ----
if (emit_joint_local) {
joint_anno <- ref_anno
for (i in seq_len(N_local)) {
joint_anno[[paste0("ANNOT_", i)]] <- score_list[[i]]
}
joint_anno <- normalize_for_ldsc(joint_anno)
joint_name <- paste0("${annotation_name}", "_joint")
joint_out_gz <- file.path("${cwd:a}", joint_name, paste0(joint_name, ".${input_chroms[_index]}.annot.gz"))
joint_out_tsv <- sub("\\.gz$", "", joint_out_gz)
dir.create(dirname(joint_out_gz), showWarnings = FALSE, recursive = TRUE)
fwrite(joint_anno, joint_out_tsv, quote = FALSE, col.names = TRUE, row.names = FALSE, sep = "\t")
}
# ----------------------------------------------------------------------------
# Step B: gzip all annot files. Uses expand="$[ ]" so bash ${var} survives.
# ----------------------------------------------------------------------------
bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
set -e
annots=()
if [ "$[str(emit_single)]" = "True" ]; then
for i in $(seq 1 $[N_targets]); do
annots+=("$[cwd:a]/$[annotation_name]_single_$i/$[annotation_name]_single_$i.$[input_chroms[_index]].annot")
done
fi
if [ "$[str(emit_joint)]" = "True" ]; then
annots+=("$[cwd:a]/$[annotation_name]_joint/$[annotation_name]_joint.$[input_chroms[_index]].annot")
fi
for a in "${annots[@]}"; do
gzip -f "$a"
done
# ----------------------------------------------------------------------------
# Step C: run polyfun's LD-score computation for each emitted annotation file.
# ----------------------------------------------------------------------------
bash: expand = "$[ ]", stderr = f'{_output[1]}.stderr', stdout = f'{_output[1]}.stdout'
set -e
chrom="$[input_chroms[_index]]"
run_polyfun() {
local annot="$1"
local out_prefix="$2"
if [ "$[str(use_print_snps)]" = "True" ]; then
$[python_exec] $[polyfun_path]/ldsc.py \
--print-snps $[snp_list] \
$[ld_window_flag] $[ld_window_param] \
--out "$out_prefix" \
--bfile $[_input[-1]:nar] \
--yes-really \
--annot "$annot" \
--l2
else
$[python_exec] $[polyfun_path]/compute_ldscores.py \
--annot "$annot" \
--bfile $[_input[-1]:nar] \
$[ld_window_flag] $[ld_window_param] \
--out "${out_prefix}.$[ldscore_ext]" \
--allow-missing
fi
}
if [ "$[str(emit_single)]" = "True" ]; then
for i in $(seq 1 $[N_targets]); do
name="$[annotation_name]_single_$i"
annot="$[cwd:a]/$name/$name.$chrom.annot.gz"
prefix="$[cwd:a]/$name/$name.$chrom"
run_polyfun "$annot" "$prefix"
done
fi
if [ "$[str(emit_joint)]" = "True" ]; then
name="$[annotation_name]_joint"
annot="$[cwd:a]/$name/$name.$chrom.annot.gz"
prefix="$[cwd:a]/$name/$name.$chrom"
run_polyfun "$annot" "$prefix"
fi
# ----------------------------------------------------------------------------
# Step D: write .l2.M and .l2.M_5_50 files for each emitted annotation directory.
# ----------------------------------------------------------------------------
R: expand = "${ }", stderr = f'{_output[2]}.stderr', stdout = f'{_output[2]}.stdout'
suppressPackageStartupMessages({ library(data.table); library(dplyr) })
use_print_snps <- ${str(use_print_snps).upper()}
chrom <- "${input_chroms[_index]}"
# Look up .frq file under frqfile_dir, using plink_name + chrom (matches cell 25).
frq_file <- file.path("${frqfile_dir}", paste0("${plink_name}", chrom, ".frq"))
has_frq <- file.exists(frq_file)
frq_dt <- if (has_frq) fread(frq_file)[, .(SNP, MAF)] else NULL
write_M_files <- function(annot_path, ldscore_path, m_path) {
if (use_print_snps && file.exists(m_path) && file.exists(paste0(m_path, "_5_50"))) {
cat("M files already exist for", m_path, "\n"); return(invisible())
}
ldscore_dt <- if (endsWith(ldscore_path, ".parquet")) {
suppressPackageStartupMessages(library(arrow)); arrow::read_parquet(ldscore_path)
} else fread(ldscore_path)
annot_dt <- fread(annot_path)
annot_filtered <- annot_dt[annot_dt$SNP %in% ldscore_dt$SNP, ]
merged <- if (has_frq) merge(annot_filtered, frq_dt, by = "SNP", all.x = TRUE) else annot_filtered
std_cols <- c("CHR", "SNP", "BP", "CM", "A1", "A2", if (has_frq) "MAF")
annot_cols <- setdiff(names(merged), std_cols)
if (length(annot_cols) == 0L) { merged[, ANNOT := 1L]; annot_cols <- "ANNOT" }
M <- merged[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]
writeLines(paste(as.numeric(M), collapse = " "), m_path)
if (has_frq) {
common <- merged[!is.na(MAF) & MAF > 0.05, ]
M5 <- common[, lapply(.SD, sum, na.rm = TRUE), .SDcols = annot_cols]
writeLines(paste(as.numeric(M5), collapse = " "), paste0(m_path, "_5_50"))
}
}
targets <- c()
if (${"TRUE" if emit_single else "FALSE"}) {
for (i in seq_len(${N_targets})) {
targets <- c(targets, paste0("${annotation_name}", "_single_", i))
}
}
if (${"TRUE" if emit_joint else "FALSE"}) {
targets <- c(targets, paste0("${annotation_name}", "_joint"))
}
for (name in targets) {
annot_path <- file.path("${cwd:a}", name, paste0(name, ".", chrom, ".annot.gz"))
ldscore_path <- file.path("${cwd:a}", name, paste0(name, ".", chrom, ".${ldscore_ext}"))
m_path <- file.path("${cwd:a}", name, paste0(name, ".", chrom, ".l2.M"))
write_M_files(annot_path, ldscore_path, m_path)
}
Munge Summary Statistics#
# The script munge_polyfun_sumstats.py automatically detects whether signed statistics
# (Z, BETA/SE, etc.) are present and computes Z-scores if needed.
[munge_sumstats_polyfun]
parameter: sumstats = path
parameter: n = 0
parameter: min_info = 0.6
parameter: min_maf = 0.001
parameter: keep_hla = False
parameter: chi2_cut = 30
input: sumstats
output: f"{_input:n}.munged.parquet"
bash: expand=True, stderr=f'{_output:nn}.stderr', stdout=f'{_output:nn}.stdout'
{python_exec} {polyfun_path}/munge_polyfun_sumstats.py \
--sumstats {_input} \
--out {_output} \
{'--n {}'.format(n) if n>0 else ''} \
{'--min-info {}'.format(min_info)} \
{'--min-maf {}'.format(min_maf)} \
{'--chi2-cutoff {}'.format(chi2_cut)} \
{'--keep-hla' if keep_hla else ''} \
--remove-strand-ambig
Calculate Functional Enrichment using Annotations#
[get_heritability]
# Per-trait sLDSC regression via polyfun. Fans out across target_anno_dirs:
# each (trait, target_dir) pair becomes one polyfun invocation. Outputs go to
# <cwd>/<basename(target_dir)>/<trait>.{results,log,part_delete}.
#
# `target_anno_dirs` is the list produced by [make_annotation_files_ldscore]:
# typically N _single_<i> directories plus optionally one _joint directory.
#
# --- about the ".results" Category column and the "_0 / _1" suffix ---
# Each (trait, target_dir) pair is ONE polyfun call; its `ldsc.py --ref-ld-chr`
# always gets exactly two LD-score sources, in this order:
# "<target_dir>/<target>." (index 0) , "<baseline_dir>/<baseline>" (index 1)
# With --overlap-annot, every annotation column in the .results "Category" is
# named <ldscore_column_name>_<ref-ld-index>:
# index 0 = the target file -> "ANNOT_0" (no-snplist; compute_ldscores.py keeps the annot col name)
# -> "L2_0" (snplist + single annot; ldsc.py hard-codes "L2", see below)
# -> "ANNOT_1_0","ANNOT_2_0" (no-snplist joint dir, N>=2 annot cols)
# -> "ANNOT_1L2_0","ANNOT_2L2_0" (snplist joint dir, N>=2 -> "<name>L2")
# index 1 = the baseline file -> "base_1","Coding_UCSC_1", ... (the 97 baseline annots)
# So in this pipeline the suffix is only ever 0 (target) or 1 (baseline); it would
# continue 0,1,2,... only if you handed `ldsc.py --ref-ld-chr` more than two sources.
# (Why ANNOT_0 vs L2_0: see the [make_annotation_files_ldscore] header — ldsc.py's
# "n_annot == 1 -> column name 'L2'" quirk vs compute_ldscores.py keeping the annot
# column name.) [postprocess] auto-detects the target Category; if you instead pass
# --target-categories, the names must match this column exactly.
#
parameter: target_anno_dirs = paths()
parameter: all_traits = []
import os
with open(all_traits_file, 'r') as f:
trait_paths = [os.path.join(sumstat_dir, line.strip()) for line in f if line.strip()]
# Build (trait, target_dir) Cartesian product as parallel flat lists.
input_list = []
target_meta = []
for td in target_anno_dirs:
for t in trait_paths:
input_list.append(t)
target_meta.append(str(td))
input: input_list, group_by = 1, group_with = "target_meta"
output: f"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.log", \
f"{cwd:a}/{os.path.basename(target_meta[_index])}/{os.path.basename(_input[0])}.results"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "${ }"
target_dir="${target_meta[_index]}"
target_name="$(basename ${target_meta[_index]})"
trait="$(basename ${_input[0]})"
output_dir="${cwd:a}/$target_name"
mkdir -p "$output_dir"
# MAF cutoff handling. Only 0 (disabled) or 0.05 (sLDSC default) are supported;
# other values would require recomputing LD scores at that cutoff.
frq_file_check="${frqfile_dir}/${plink_name}22.frq"
if [ "${maf_cutoff}" = "0" ] || [ "${maf_cutoff}" = "0.0" ]; then
echo "maf_cutoff = 0: skipping MAF filter (--not-M-5-50)"
frq_option="--not-M-5-50"
elif [ "${maf_cutoff}" = "0.05" ]; then
if [ -f "$frq_file_check" ]; then
echo "maf_cutoff = 0.05: using --frqfile-chr (MAF > 5%)"
frq_option="--frqfile-chr ${frqfile_dir}/${plink_name}"
else
echo "ERROR: maf_cutoff=0.05 requires .frq files for the reference panel,"
echo " but none found at ${frqfile_dir}/${plink_name}*.frq."
echo " Provide .frq files in frqfile_dir, or set maf_cutoff=0 (NOT recommended)."
exit 1
fi
else
echo "ERROR: maf_cutoff=${maf_cutoff} is not supported. Only 0 (no filter) or"
echo " 0.05 (sLDSC default) are accepted. Other values would require"
echo " recomputing LD scores at that cutoff."
exit 1
fi
run_ldsc() {
local extra_args="$1"
${python_exec} ${polyfun_path}/ldsc.py \
--h2 ${sumstat_dir}/$trait \
--ref-ld-chr "$target_dir/$target_name.","${baseline_ld_dir}/${baseline_name}" \
--out "$output_dir/$trait" \
--overlap-annot \
--w-ld-chr ${weights_dir}/${weight_name} \
$frq_option \
--print-coefficients \
--print-delete-vals \
--n-blocks ${n_blocks} \
$extra_args
}
run_ldsc ""
log_file="$output_dir/$trait.log"
# FloatingPointError retry ladder (preserved from original): 30 -> 20 -> 10
for max in 30 20 10; do
if [ -f "$log_file" ] && grep -q "FloatingPointError\|invalid value encountered in sqrt" "$log_file"; then
echo "FloatingPointError detected, retrying with --chisq-max $max..."
run_ldsc "--chisq-max $max"
else
break
fi
done
if [ -f "$log_file" ] && grep -q "FloatingPointError\|invalid value encountered in sqrt" "$log_file"; then
echo "ERROR: FloatingPointError persists for trait $trait at target $target_name even with --chisq-max 10"
echo "This trait may have severe numerical instability issues in the summary statistics."
fi
[postprocess]
# Post-processing of polyfun outputs via pecotmr::sldsc_postprocessing_pipeline.
# Reads .results / .log / .part_delete for all traits in `traits_file`, both
# single-target and (when present) joint-target runs, computes Gazal-style
# tau*, EnrichStat with back-solved jackknife SE, and runs the default
# DerSimonian-Laird random-effects meta across all supplied traits. Writes
# one RDS containing per-trait tables and three meta tables (tau*, E, EnrichStat).
parameter: traits_file = path() # text file: one trait sumstats filename per line
parameter: heritability_cwd = path() # parent directory of [get_heritability] outputs (contains <annotation_name>_single_<i>/ subdirs and optionally <annotation_name>_joint/)
parameter: target_categories = [] # target annotation names. Auto-detected from the joint-run results if empty.
parameter: target_categories_label = [] # optional display names, same order as target_categories;
# when given, every "target" column / tau*-block colname in
# the output RDS is renamed to these (params$target_categories
# holds the labels, params$target_categories_orig the originals).
parameter: target_anno_dir = path() # directory of target .annot.gz files used for sd_C and binary detection (typically the joint dir, since it carries all target columns)
input: traits_file
output: f"{cwd:a}/{annotation_name}.sldsc_postprocess.rds"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
library(pecotmr)
traits <- readLines("${traits_file}")
target_cats <- c(${",".join('"%s"' % c for c in target_categories)})
target_lab <- c(${",".join('"%s"' % c for c in target_categories_label)})
# Auto-detect single-target and joint-target output directories.
her_root <- "${heritability_cwd}"
all_subdirs <- list.dirs(her_root, recursive = FALSE)
single_pattern <- paste0("^", "${annotation_name}", "_single_([0-9]+)$")
joint_name <- paste0("${annotation_name}", "_joint")
single_dirs <- all_subdirs[grepl(single_pattern, basename(all_subdirs))]
single_indices <- as.integer(sub(single_pattern, "\\1", basename(single_dirs)))
single_dirs <- single_dirs[order(single_indices)]
joint_dir <- file.path(her_root, joint_name)
has_joint <- dir.exists(joint_dir)
message(sprintf("Detected %d single-target dirs%s",
length(single_dirs),
if (has_joint) "; joint-target dir present" else "; no joint-target dir"))
# Build per-trait prefix maps. Each trait's polyfun output is at <dir>/<trait>
# (polyfun appends .results / .log / .part_delete).
trait_single_prefixes <- lapply(traits, function(t) file.path(single_dirs, t))
names(trait_single_prefixes) <- traits
if (has_joint) {
trait_joint_prefix <- setNames(file.path(joint_dir, traits), traits)
} else {
trait_joint_prefix <- setNames(rep(NA_character_, length(traits)), traits)
}
res <- sldsc_postprocessing_pipeline(
trait_single_prefixes = trait_single_prefixes,
trait_joint_prefix = trait_joint_prefix,
target_anno_dir = "${target_anno_dir}",
frqfile_dir = "${frqfile_dir}",
plink_name = "${plink_name}",
maf_cutoff = ${maf_cutoff},
target_categories = if (length(target_cats) > 0) target_cats else NULL,
target_labels = if (length(target_lab) > 0) target_lab else NULL
)
saveRDS(res, "${_output[0]}")
message("S-LDSC post-processing complete; results written to ${_output[0]}")
[meta_subset]
# Optional: re-run random-effects meta on a user-defined subset of traits, using
# the cached per-trait standardized results from [postprocess]. No regression rerun.
parameter: postprocess_rds = path() # output of [postprocess]
parameter: subset_traits_file = path() # text file: one trait id per line, subset of those passed to [postprocess]
parameter: subset_name = str # label used in the output filename
parameter: target_categories = [] # target annotation names to meta on; if empty, uses all from postprocess output
# If [postprocess] was run with --target-categories-label, the cached RDS already
# carries the display names (params$target_categories = the labels), so leave
# --target-categories empty here (or pass the labels, not the original ANNOT_* names).
input: postprocess_rds, subset_traits_file
output: f"{cwd:a}/{annotation_name}.{subset_name}.meta.rds"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand = "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
library(pecotmr)
res <- readRDS("${postprocess_rds}")
subset_traits <- readLines("${subset_traits_file}")
target_cats <- c(${",".join([f'"{c}"' for c in target_categories])})
if (length(target_cats) == 0) target_cats <- res$params$target_categories
subset_per_trait <- res$per_trait[subset_traits]
# Map wide names (tau_star_single/joint) to bare names meta_sldsc_random expects.
view_single <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, "single")
view_joint <- pecotmr:::.sldsc_view_for_meta(subset_per_trait, "joint")
out <- list(
tau_star_single = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, "tau_star")), target_cats),
tau_star_joint = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_joint, c, "tau_star")), target_cats),
enrichment = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, "enrichment")), target_cats),
enrichstat = setNames(lapply(target_cats, function(c) meta_sldsc_random(view_single, c, "enrichstat")), target_cats)
)
saveRDS(out, "${_output[0]}")
message("Subset meta complete; results written to ${_output[0]}")