Stratified LD Score Regression

Stratified LD Score Regression#

This notebook implements the pipepline of S-LDSC for LD score and functional enrichment analysis.

Important: the S-LDSC implementation comes for the polyfun package, not the original LDSC from bulik/ldsc GitHub repo.

The purpose of this pipeline is to use LD Score Regression (LDSC) to analyze the heritability and enrichment of genome annotations across GWAS traits. By integrating genome annotation files and GWAS summary statistics, this pipeline allows single tau analysis (individual annotation contributions) and joint tau analysis (independent contributions of multiple annotations after removing shared effects).

The pipeline is developed to integrate GWAS summary statistics data, annotation data, and LD reference panel data to compute functional enrichment for each of the epigenomic annotations that the user provides using the S-LDSC model. We will first start off with an introduction, instructions to set up, and the minimal working examples. Then the workflow code that can be run using SoS on any data will be at the end.

A brief review on Stratified LD score regression#

Here I briefly review LD Score Regression and what it is used for. For more in depth information on LD Score Regression please read the following three papers:

“LD Score regression distinguishes confounding from polygenicity in genome-wide association studies” by Sullivan et al (2015)
“Partitioning heritability by functional annotation using genome-wide association summary statistics” by Finucane et al (2015)
“Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection” by Gazal et al (2017)

As stated in Sullivan et al 2015, confounding factors and polygenic effects can cause inflated test statistics and other methods cannot distinguish between inflation from confounding bias and a true signal. LD Score Regression (LDSC) is a technique that aims to identify the impact of confounding factors and polygenic effects using information from GWAS summary statistics.

This approach involves using regression to mesaure the relationship between Linkage Disequilibrium (LD) scores and test statistics of SNPs from the GWAS summary statistics. Variants in LD with a “causal” variant show an elevation in test statistics in association analysis proportional to their LD (measured by $r^2$) with the causal variant within a certain window size (could be 1 cM, 1kB, etc.). In contrast, inflation from confounders such as population stratification that occur purely from genetic drift will not correlate with LD. For a polygenic trait, SNPs with a high LD score will have more significant χ2 statistics on average than SNPs with a low LD score. Thus, if we regress the $\chi^2$ statistics from GWAS against LD Score, the intercept minus one is an estimator of the mean contribution of confounding bias to the inflation in the test statistics. The regression model is known as LD Score regression.

LDSC model#

Under a polygenic assumption, in which effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$ where p is the minor allele frequency (MAF), the expected $\chi^2$ statistic of variant j is:

\[E[\chi^2|l_j] = Nh^2l_j/M + Na + 1 \quad (1)\]

where $N$ is the sample size; $M$ is the number of SNPs, such that $h^2/M$ is the average heritability explained per SNP; $a$ measures the contribution of confounding biases, such as cryptic relatedness and population stratification; and $l_j = \sum_k r^2_{jk}$ is the LD Score of variant $j$, which measures the amount of genetic variation tagged by $j$. A full derivation of this equation is provided in the Supplementary Note of Sullivan et al (2015). An alternative derivation is provided in Supplementary Note of Zhu and Stephens (2017) AoAS.

From this we can see that LD Score regression can be used to compute SNP-based heritability for a phenotype or trait, from GWAS summary statistics and does not require genotype information like other methods such as REML do.

Stratified LDSC#

Heritability is the proportion of phenotypic variation (VP) that is due to variation in genetic values (VG) and thus can tell us how much of the difference in observed phenotypes in a sample is due to difference in genetics in the sample. It can also be extended to analyze partitioned heritability for a phenotype/trait split over categories.

For Partitioned Heritability or Stratified LD Score Regression (S-LDSC) more power is added to our analysis by leveraging LD Score information as well as using SNPs that haven’t reached Genome Wide Significance to partition heritability for a trait over categories which many other methods do not do.

S-LDSC relies on the fact that the $\chi^2$ association statistic for a given SNP includes the effects of all SNPs tagged by this SNP meaning that in a region of high LD in the genome the given SNP from the GWAS represents the effects of a group of SNPs in that region.

S-LDSC determines that a category of SNPs is enriched for heritability if SNPs with high LD to that category have more significant $\chi^2$ statistics than SNPs with low LD to that category.

Here, enrichment of a category is defined as the proportion of SNP heritability in the category divided by the proportion of SNPs in that category.

More precisely, under a polygenic model, the expected $\chi^2$ statistic of SNP $j$ is

\[E[\chi^2_j] = N\sum_CT_Cl(j,C) + Na + 1 \quad (2)\]

where $N$ is sample size, C indexes categories, $ℓ(j, C)$ is the LD score of SNP j with respect to category $l(j,C) = \sum_{k\epsilon C} r^2_{jk}$, $a$ is a term that measures the contribution of confounding biases, and if the categories are disjoint, $\tau_C$ is the per-SNP heritability in category $C$; if the categories overlap, then the per-SNP heritability of SNP j is $\sum_{C:j\epsilon C} \tau_C$. Equation 2 allows us to estimate $\tau_C$ via a (computationally simple) multiple regression of $\chi^2$ against $ℓ(j, C)$, for either a quantitative or case-control study.

To see how these methods have been applied to real world data as well as a further discussion on methods and comparisons to other methods please read the three papers listed at the top of the document.

Tau Estimation and Enrichment Analysis#

To quantify the contribution of functional annotations to trait heritability and evaluate their statistical significance, while accounting for linkage disequilibrium (LD) structure and annotation-specific properties.

Tau (τ):#

The standardized per-SNP heritability contribution of an annotation.

In joint-annotation analysis: Represents the independent contribution of an annotation after controlling for overlapping effects of other annotations.

Statistical Workflow#

1. Inputs#

.annot.gz: Annotation files defining SNP membership in functional categories.
.part_delete: Jackknife delete-values for τ estimates (200 genomic blocks × annotations).
.results: Regression coefficients (τ), enrichment statistics, and proportions.
.log: Total heritability ($h_g^2$) and quality control metrics.

2. Tau Standardization#

For each annotation j:

\[\tau_j^{std} = \tau_j \times \frac{sd_j \cdot M_{ref}}{h_g^2}\]

$sd_j$: Standard deviation of annotation j across SNPs.
$M_{ref} = 5,961,159$: Reference SNP count for genome-wide scaling.
$h_g^2$: Total heritability (normalizes τ to per-unit contribution).
$\tau_j^{std}$ defined as the additive change in per-SNP heritability associated to a 1 standard deviation increase in the value of the annotation, divided by the average per-SNP heritability over all SNPs for the trait.

3. Enrichment Calculation#

Enrichment is defined as the proportion of heritability explained by variants in the target annotation divided by the proportion of variants in the annotation.

a. Direct Enrichment Statistics#

For each annotation j:

\[E_j = \frac{\text{Prop.\_h2}_j}{\text{Prop.\_SNPs}_j}= \tau_j \times \frac{M_{ref}}{h_g^2}\]

Where:

$\text{Prop.\_h2}_j$: Proportion of heritability explained by annotation j
$\text{Prop.\_SNPs}_j$: Proportion of SNPs in annotation j

b. Standardized Enrichment Statistics#

The goal is to compute the p-value based on the assumption of a normal distribution.

\[\text{EnrichStat}_j = \frac{h_g^2}{M_{ref}} \times \left[\frac{\text{Prop.\_h2}_j}{\text{Prop.\_SNPs}_j} - \frac{1-\text{Prop.\_h2}_j}{1-\text{Prop.\_SNPs}_j}\right]\]

Standard error derivation: $$Z_j = \Phi^{-1}(\text{Enrichment\_p}_j/2)$$SE_{\text{EnrichStat}_j} = \frac{\text{EnrichStat}_j}{Z_j}$$

Where:

$\Phi^{-1}$: Inverse normal cumulative distribution function
$\text{Enrichment\_p}_j$: Enrichment p-value from .results file

\[ Z_j = \frac{\text{EnrichStat}_j}{SE_{\text{EnrichStat}_j}} \]

Meta-Analysis of Partitioned Heritability#

Purpose#

To integrate τ estimates across multiple traits or annotation groups, improving power and generalizability.

Random-Effects Meta-Analysis#

Models heterogeneity across traits:

\[\tau_{meta} = \frac{\sum w_i\tau_i}{\sum w_i}, \quad w_i = \frac{1}{SE_i^2 + \sigma^2}\]

$\sigma^2$: Between-trait variance component.
$SE_i$: Standard error of τ for trait i.

For enrichment meta-analysis:

Effect sizes use direct enrichment statistics $(E_j, SE_{E_j})$
P-values derived from standardized statistics meta-analysis $({\text{EnrichStat}_j}, SE_{\text{EnrichStat}_j})$

Z-score: $$Z = \frac{\tau_{meta}}{SE_{meta}}$$

P-value: $$p = 2 \times \Phi(-|Z|)$$

Input#

1. Annotation File#

Purpose: Specifies genome annotation files for single or joint tau analysis.

Formats:

Text file (.txt) containing paths to annotation files, annotation files can be rds/tsv/txt format.
Alternatively, specific annotation files for a single chromosome can be provided directly.

Examples:

Single Annotation (joint_tau = False):

#id   path
1     /home/al4225/project/quantile_twas/analysis/output/157genes/step2_rq_lr_summary/enrich_info_by_context_class/AC_DeJager_eQTL.unique_qr.rds
22    /home/al4225/project/quantile_twas/analysis/output/157genes/step2_rq_lr_summary/enrich_info_by_context_class/AC_DeJager_eQTL.unique_qr.rds

Joint Annotation (joint_tau = True):

#id   path1                     path2
1     /path/to/annotation1.rds   /path/to/annotation2.rds
22    /path/to/annotation1.rds   /path/to/annotation2.rds

Example Files:

Format (the score column is optional, if this column doen’t exit, add score as 1):

1. is_range = False

    chr   pos   score
    1    10001   1
    1    10002   1

1. is_range = True

    chr   start   end   score
    1    10001   20001  1
    1    30001   40001  1

2. Reference Annotation File#

Purpose: Provides reference annotation files in .annot.gz format for each chromosome.
Formats:
- Text file listing annotation files for all chromosomes.
- Alternatively, files for specific chromosomes can be provided directly.
- Example: /home/al4225/project/quantile_twas/analysis/SLDSC/data/reference_annotation.txt
```
#id   path
1     /mnt/vast/hpc/csg/xc2270/colocboost/post/SLDSC/example_anno/ABC_Road_GI_BRN.1.annot.gz
22    /mnt/vast/hpc/csg/xc2270/colocboost/post/SLDSC/example_anno/ABC_Road_GI_BRN.22.annot.gz
```

3. Genome Reference File#

Purpose: Specifies genome reference .bed files PLINK format for each chromosome.
Formats:
- Text file listing reference files for all chromosomes.
- Alternatively, files for specific chromosomes can be provided directly.
- Example: /home/al4225/project/quantile_twas/analysis/SLDSC/data/genome_reference_bfile.txt
```
#id   path
1     /mnt/vast/hpc/csg/xc2270/colocboost/post/SLDSC/plink_files/1000G.EUR.hg38.1.bed
22    /mnt/vast/hpc/csg/xc2270/colocboost/post/SLDSC/plink_files/1000G.EUR.hg38.22.bed
```

4. SNP List#

File Path: /mnt/vast/hpc/csg/xc2270/colocboost/post/SLDSC/1000G_EUR_Phase3_hg38/list.txt
Purpose: Specifies SNPs for LDSC analysis.
Format: A list of rsids (e.g., 1217311 rows).
```
rs12345
rs67890
...
```

5. ALL Group Files#

Purpose: GWAS summary statistics containning all sumtats.
Example Files:

/home/al4225/project/quantile_twas/analysis/SLDSC/data/sumstats_test_all.txt

Format:

CAD_META.filtered.sumstats.gz
UKB.Lym.BOLT.sumstats.gz

Workflow Steps#

Step 1: Annotation File Processing#

Purpose: Generate LD score files and genome annotations for each chromosome.
Inputs:
- annotation_file
- reference_anno_file
- genome_ref_file
Key Parameters:
- joint_tau = False: Single annotation analysis (one column path).
- joint_tau = True: Joint annotation analysis (multiple path1, path2 columns starting with path).
- chromosome: Optional, restricts processing to specific chromosomes.
Outputs:
- Per chromosome:
  - .annot.gz: Genome annotation scores.
  - .l2.ldscore.parquet: LD score file.
  - .l2.M: Total SNP count.
  - .l2.M_5_50: SNP count in LD blocks.
  - .log: Processing logs.

Step 2: Munge Summary Statistics#

Purpose: Preprocess GWAS summary statistics into LDSC-compatible format.
Inputs:
- GWAS summary statistics.
- HapMap3 SNP list.
Outputs:
- Harmonized GWAS summary statistics.

Step 3: Heritability Analysis#

Purpose: Estimate heritability (h²) of traits using LDSC regression.
Inputs:
- sumstat_dir: GWAS summary statistics.
- target_anno_dir: Annotation directory.
- baseline_ld_dir: Baseline LD scores.
Outputs:
- Per trait:
  - .results: Heritability estimates and enrichment (Enrichment, Prop_h2).
  - .log: Analysis logs.
  - .part_delete: SNP-level contributions for annotations (target + baseline).
  - .delete: Aggregate contributions across SNPs for each annotation.

Step 4: Initial Processed Statistics#

Purpose: Calculate tau statistics and prepare intermediate files for meta-analysis.
Inputs:
- All GWAS traits and their annotation scores.
Outputs:
- RDS file containing:
  - single_tau: Contribution of each annotation to heritability.
  - enrichment: Enrichment ratio of heritability to SNP density.
  - joint_tau: Independent effect of each annotation after removing shared contributions.

Step 5: Meta Analysis#

Purpose: Perform meta-analysis across trait groups.
Inputs:
- trait_group_paths: Paths to files listing traits in each group.
- trait_group_names: Names of trait groups.
Outputs:
- Results for each trait group:
  - mean, p-value, SE.
- Includes:
  - single_tau: Single annotation effects.
  - enrichment: Heritability enrichment.
  - joint_tau: Independent contributions of annotations.

MWE:#

1. make_annotation_files_ldscore#

annotation file can be a txt file with #id, and path1 path2 …, also can be rds files seperate by ‘,’

1.1 single tau analysis, with one annotation as a input#

 # case 1: txt file as input
 sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \
    --annotation_file data/polyfun/input/colocboost_test_annotation_path.txt \
    --reference_anno_file data/polyfun/input/reference_annotation0.txt \
    --genome_ref_file data/polyfun/input/genome_reference_bfile.txt \
    --annotation_name test_colocboost \
    --plink_name reference. \
    --baseline_name annotations. \
    --weight_name weights. \
    --python_exec python \
    --polyfun_path data/github/polyfun \
    --cwd output/polyfun/ -j 22

Alternatively, we can also use files with specific chromosome, instead of txt list.

# single file format
 sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \
    --annotation_file data/polyfun/input/colocboost_test.tsv \
    --reference_anno_file data/polyfun/example_annot0/annotations.1.annot.gz \
    --genome_ref_file data/polyfun/example_data/reference.1.bed \
    --annotation_name test_colocboost \
    --plink_name reference. \
    --baseline_name annotations. \
    --weight_name weights. \
    --python_exec python \
    --polyfun_path data/github/polyfun \
    --cwd output/polyfun/ --chromosome 1

1.2 joint tau#

with more than one annotation as the input

# input case1: txt file format
 sos run /home/al4225/project/quantile_twas/analysis/SLDSC/scripts/ldsc.ipynb make_annotation_files_ldscore \
    --annotation_file /home/al4225/project/quantile_twas/analysis/SLDSC/data/quantile_qtl_annotation/test/AC_DeJager_eQTL.joint_tau_example.txt \
    --reference_anno_file /home/al4225/project/quantile_twas/analysis/SLDSC/data/reference_annotation.txt \
    --genome_ref_file /home/al4225/project/quantile_twas/analysis/SLDSC/data/genome_reference_bfile.txt \
    --snp_list /mnt/vast/hpc/csg/xc2270/colocboost/post/SLDSC/1000G_EUR_Phase3_hg38/list.txt \
    --annotation_name joint_test \
    --cwd /home/al4225/project/quantile_twas/analysis/SLDSC/output --joint_tau --chromosome 1

 # input case 2: single file format, annotation files separate by ','.
 sos run /home/al4225/project/quantile_twas/analysis/SLDSC/scripts/ldsc.ipynb make_annotation_files_ldscore \
    --annotation_file /home/al4225/project/quantile_twas/analysis/output/157genes/step2_rq_lr_summary/enrich_info_by_context_class/AC_DeJager_eQTL.unique_qr.rds,/home/al4225/project/quantile_twas/analysis/output/157genes/step2_rq_lr_summary/enrich_info_by_context_class/AC_DeJager_eQTL.shared_homo.rds \
    --reference_anno_file /mnt/vast/hpc/csg/xc2270/colocboost/post/SLDSC/example_anno/ABC_Road_GI_BRN.1.annot.gz \
    --genome_ref_file /mnt/vast/hpc/csg/xc2270/colocboost/post/SLDSC/plink_files/1000G.EUR.hg38.1.bed \
    --snp_list /mnt/vast/hpc/csg/xc2270/colocboost/post/SLDSC/1000G_EUR_Phase3_hg38/list.txt \
    --annotation_name joint_test \
    --cwd /home/al4225/project/quantile_twas/analysis/SLDSC/output/mwe/joint_tau --joint_tau --chromosome 1

2. get_heritability#

sos run pipeline/sldsc_enrichment.ipynb get_heritability \
    --target_anno_dir output/polyfun/test_colocboost \
    --sumstat_dir data/polyfun/example_data \
    --baseline_ld_dir data/polyfun/example_data \
    --python_exec python \
    --polyfun_path data/github/polyfun \
    --weights_dir data/polyfun/example_data \
    --plink_name reference. \
    --baseline_name annotations. \
    --weight_name weights. \
    --annotation_name test_colocboost \
    --cwd output/polyfun/test_colocboost/sumstats/ \
    --all_traits_file data/polyfun/input/sumstats_test_all.txt \
    -s build -j 2

3. processed_stats#

3.1 single tau analysis, with one annotation as a input#

# processed stats cwd has to be the same with get_heritability
sos run pipeline/sldsc_enrichment.ipynb processed_stats \
    --target_anno_dir output/polyfun/test_colocboost \
    --sumstat_dir data/polyfun/example_data \
    --baseline_ld_dir data/polyfun/example_data \
    --python_exec python \
    --polyfun_path data/github/polyfun \
    --weights_dir data/polyfun/example_data \
    --plink_name reference. \
    --baseline_name annotations. \
    --weight_name weights. \
    --annotation_name test_colocboost \
    --cwd output/polyfun/test_colocboost/sumstats/ \
    --trait_group_paths "data/polyfun/input/sumstats_test_all.txt data/polyfun/input/sumstats_test_category1.txt" \
    --trait_group_names "All category1" \
    --all_traits_file data/polyfun/input/sumstats_test_all.txt

3.2 joint tau#