Summary Level Data Colocalization
Source:vignettes/Summary_Level_Colocalization.Rmd
Summary_Level_Colocalization.RmdThis vignette demonstrates how to perform multi-trait colocalization
analysis using summary statistics data in ColocBoost, specifically
focusing on the Sumstat_5traits dataset included in the
package.
1. The Sumstat_5traits Dataset
The Sumstat_5traits dataset contains 5 simulated summary
statistics, where it is directly derived from the
Ind_5traits dataset using marginal association. The dataset
is specifically designed for evaluating and demonstrating the
capabilities of ColocBoost in multiple trait colocalization analysis
with summary association data.
-
sumstat: A list of data.frames of summary statistics for different traits. -
true_effect_variants: True effect variants indices for each trait. - Note that
LDcould be calculated from theXdata in theInd_5traitsdataset, but it is not included in theSumstat_5traitsdataset.
Causal variant structure
The dataset features two causal variants with indices 644 and 2289.
- Causal variant 644 is associated with traits 1, 2, 3, and 4.
- Causal variant 2289 is associated with traits 2, 3, and 5.
This structure creates a realistic scenario where multiple traits are influenced by different but overlapping sets of genetic variants.
Important data format for summary data
Must include the following columns: - z or
(beta, sebeta): either z-score or (effect size
and standard error) - n: sample size for the summary
statistics, it is highly recommendation to provide. -
variant: required if sumstat for different outcomes do not
have the same number of variables (multiple sumstat and multiple
LD).
2. Run ColocBoost (Basic usage)
The preferred format for colocalization analysis in ColocBoost using summary statistics data is where one LD matrix is provided for all traits, and the summary statistics are organized in a list. The Basic format us
-
sumstatis organized as a list of data.frames for all traits -
LDis a matrix of linkage disequilibrium (LD) information for all variants across all traits.
This function requires specifying summary statistics
sumstat and LD matrix LD from the dataset:
# Extract genotype (X) and calculate LD matrix
data("Ind_5traits")
LD <- get_cormat(Ind_5traits$X[[1]])
# Run colocboost
res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD)
#> Starting checking the input data.
#> Starting gradient boosting algorithm.
#> Boosting iterations for outcome 4 converge after 34 iterations!
#> Boosting iterations for outcome 5 converge after 43 iterations!
#> Boosting iterations for outcome 1 converge after 46 iterations!
#> Boosting iterations for outcome 2 converge after 67 iterations!
#> Boosting iterations for outcome 3 converge after 68 iterations!
#> Starting assemble analyses and results summary.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 636 644 618 655
#>
#> $`cos2:y2_y3_y5`
#> [1] 2289 22933. Run ColocBoost (Advance usage)
3.1. Matched LD with multiple sumstat (Trait-specific LD)
When studying multiple traits with its own trait-specific LD matrix, you could provide a list of LD matrices matched with a list of summary statistics.
-
Basic format:
sumstatandLDare organized as lists, matched by trait index,-
(sumstat[1], LD[1])contains information for trait 1, -
(sumstat[2], LD[2])contains information for trait 2, - And so on for each trait under analysis.
-
-
Cross-trait flexibility:
- There is no requirement for the same variants across different traits. This allows for the analysis of traits with variants avaiable.
- This is particularly useful when you have a large dataset with many traits and want to focus on specific variants and trait-specific LD.
# Duplicate LD with matched summary statistics
LD_multiple <- lapply(1:length(Sumstat_5traits$sumstat), function(i) LD )
# Run colocboost
res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_multiple)
#> Starting checking the input data.
#> Starting gradient boosting algorithm.
#> Boosting iterations for outcome 4 converge after 34 iterations!
#> Boosting iterations for outcome 5 converge after 43 iterations!
#> Boosting iterations for outcome 1 converge after 46 iterations!
#> Boosting iterations for outcome 2 converge after 67 iterations!
#> Boosting iterations for outcome 3 converge after 68 iterations!
#> Starting assemble analyses and results summary.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 636 644 618 655
#>
#> $`cos2:y2_y3_y5`
#> [1] 2289 22933.2. LD matrix is a superset of variants across different summary statistics
When the LD matrix includes a superset of variants across different summary statistics, with Input Format:
-
sumstatis a list of data.frames for all traits -
LDis a matrix of linkage disequilibrium (LD) information for all variants across all traits. - The LD matrix should contain all variants present in the summary statistics data frames.
- This is particularly useful when you have a large LD matrix from a reference panel and want to use it for multiple summary statistics datasets. It allows for efficient analysis without redundancy.
# Create sumstat with different number of variants - remove 100 variants in each sumstat
LD_superset <- LD
sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 100), , drop = FALSE])
# Run colocboost
res <- colocboost(sumstat = sumstat, LD = LD_superset)
#> Starting checking the input data.
#> Starting gradient boosting algorithm.
#> Boosting iterations for outcome 4 converge after 34 iterations!
#> Boosting iterations for outcome 5 converge after 45 iterations!
#> Boosting iterations for outcome 1 converge after 46 iterations!
#> Boosting iterations for outcome 2 converge after 67 iterations!
#> Boosting iterations for outcome 3 converge after 68 iterations!
#> Starting assemble analyses and results summary.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 636 644 618 655
#>
#> $`cos2:y2_y3_y5`
#> [1] 2289 22933.3. Arbitrary LD and sumstat with dictionary provided
When studying multiple traits with arbitrary LD matrices for different summary statistics, we also provide the interface for arbitrary LD matrices with multiple sumstat. This particularly benefits meta-analysis across heterogeneous datasets where, for different subsets of summary statistics, LD comes from different population.
-
Input Format:
-
sumstatis a list of data.frames for all traits. -
LDis a list of LD matrices. -
dict_sumstatLDis a dictionary matrix that index of sumstat to index of LD.
-
# Create a simple dictionary for demonstration purposes
LD_arbitrary <- list(LD, LD) # traits 1 and 2 matched to the first genotype matrix; traits 3,4,5 matched to the third genotype matrix.
dict_sumstatLD = cbind(c(1:5), c(1,1,2,2,2))
# Display the dictionary
dict_sumstatLD
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 1
#> [3,] 3 2
#> [4,] 4 2
#> [5,] 5 2
# Run colocboost
res <- colocboost(sumstat = Sumstat_5traits$sumstat, LD = LD_arbitrary, dict_sumstatLD = dict_sumstatLD)
#> Starting checking the input data.
#> Starting gradient boosting algorithm.
#> Boosting iterations for outcome 4 converge after 34 iterations!
#> Boosting iterations for outcome 5 converge after 43 iterations!
#> Boosting iterations for outcome 1 converge after 46 iterations!
#> Boosting iterations for outcome 2 converge after 67 iterations!
#> Boosting iterations for outcome 3 converge after 68 iterations!
#> Starting assemble analyses and results summary.
# Identified CoS
res$cos_details$cos$cos_index
#> $`cos1:y1_y2_y3_y4`
#> [1] 636 644 618 655
#>
#> $`cos2:y2_y3_y5`
#> [1] 2289 2293