Bioinformatics Pipeline for ColocBoost

This vignette demonstrates how to use the bioinformatics pipeline for ColocBoost to perform colocalization analysis with colocboost for multiple individual-level datasets and/or summary statistics datasets.

See more details about functions in the package pecotmr with link and colocboost_pipeline with link.
See more details about input data preparation in xqtl_protocol with link.

Acknowledgment: Thanks to Kate (Kathryn) Lawrence (GitHub:@kal26) for her contributions to this vignette.

1. ColocBoost analysis with basic QC steps

The colocboost_analysis() function from pecotmr runs ColocBoost with optional basic QC before model fitting. It uses conventional ColocBoost inputs, such as X, Y, sumstat, and LD, while adding QC parameters for common data-cleaning steps.

The QC parameters are optional and can be set according to the input data:

missing_rate_thresh removes variants with high genotype missingness.
maf_cutoff removes variants with low minor allele frequency.
xvar_cutoff removes variants with low genotype variance.
ld_reference_meta_file filters individual-level variants against a reference variant list.
pip_cutoff_to_skip_ind skips weak individual-level contexts based on single-effect PIP screening; use 0 to skip this screening.
qc_method controls summary-statistics QC. Use "none" for basic allele/variant harmonization only, or "slalom" / "dentist" for LD-mismatch outlier detection.
keep_indel controls whether insertion/deletion variants are retained during harmonization.
pip_cutoff_to_skip_sumstat skips weak summary-statistics studies based on single-effect PIP screening; use 0 to skip this screening.
impute runs RAISS imputation after QC when set to TRUE.
impute_opts sets RAISS imputation options, including rcond, R2_threshold, minimum_ld, and lamb.
LD_reference_info provides extra reference metadata for QC when LD or X_ref variant names are not sufficient.
variant_convention specifies allele order in ColocBoost-style variant IDs, either "A2_A1" or "A1_A2".

Example:

fit <- colocboost_analysis(
    X = X,
    Y = Y,
    dict_YX = dict_YX,
    sumstat = sumstat,
    X_ref = X_ref,
    dict_sumstatLD = dict_sumstatLD,
    outcome_names = outcome_names,
    missing_rate_thresh = 0.1,
    maf_cutoff = 0.0005,
    xvar_cutoff = 0,
    pip_cutoff_to_skip_ind = 0,
    qc_method = "none",
    keep_indel = TRUE,
    pip_cutoff_to_skip_sumstat = 0,
    impute = FALSE,
    variant_convention = "A2_A1"
)

2. Loading Data using `colocboost_pipeline` function

This function harmonizes the input data and prepares it for colocalization analysis.

In this section, we introduce how to load the regional data required for the ColocBoost analysis using the load_multitask_regional_data function. This function loads mixed datasets for a specific region, including individual-level data (genotype, phenotype, covariate data), summary statistics (sumstats, LD), or a combination of both. It runs load_regional_univariate_data for each individual-level dataset and load_rss_data for each summary statistics dataset. The output is a list with individual_data and sumstat_data components, where individual_data is a list of individual-level data and sumstat_data is a list of summary statistics data. This list is then passed to the colocboost_pipeline function for the colocalization analysis.

Below are the input parameters for this function for loading individual-level data:

2.1. Loading individual-level data from multiple cohorts

Inputs:

region: String ; Genomic region of interest in the format of chr:start-end for the phenotype region you want to analyze.
genotype_list: Character vector; Paths for PLINK bed files containing genotype data (do NOT include .bed suffix).
phenotype_list: Character vector; Paths for phenotype file names.
covariate_list: Character vector; Paths for covariate file names for each phenotype. Must have the same length as the phenotype file vector.
conditions_list_individual: Character vector; Strings representing different conditions or groups used for naming. Must have the same length as the phenotype file vector.
match_geno_pheno: Integer vector; Indices of phenotypes matched to genotype if multiple genotype PLINK files are used. For each phenotype file in phenotype_list, the index of the genotype file in genotype_list it matches with.
association_window: String; Genomic region of interest in the format of chr:start-end for the association analysis window of variants to test (cis or trans). If not provided, all genotype data will be loaded.
extract_region_name: List of character vectors; Phenotype names (e.g., gene ID ENSG00000269699) to subset the phenotype data when there are multiple phenotypes available in the region. Must have the same length as the phenotype file vector. Default is NULL, which will use all phenotypes in the region.
region_name_col: Integer; 1-based index of the column containing the region name (i.e. 4 for gene ID in a bed file). Required if extract_region_name is not NULL, or if multiple phenotypes fall into the same region in one phenotype file
keep_indel: Logical; indicating whether to keep insertions/deletions (INDELs). Default is TRUE.
keep_samples: Character vector; Sample names to keep. Default is NULL. Currently only supports keeping the same samples from all genotype and phenotype files.
maf_cutoff: Numeric; Minimum minor allele frequency (MAF) cutoff. Default is 0.
mac_cutoff: Numeric; Minimum minor allele count (MAC) cutoff. Default is 0.
xvar_cutoff: Numeric; Minimum genotype variance cutoff. Default is 0.
imiss_cutoff: Numeric; Maximum individual missingness cutoff. Default is 0.

Outputs:

region_data: List (with individual_data, sumstat_data); Output of the load_multitask_regional_data function. If only individual-level data is loaded, sumstat_data will be NULL.

Individual-level data loading example

The following example demonstrates how to set up input data with 3 phenotypes and 2 cohorts. The first cohort has 2 phenotypes and the second cohort has 1 phenotype. The first phenotype has 2 genes and the second phenotype has 1 gene.

# Example of loading individual-level data
region = "chr1:1000000-2000000"
genotype_list = c("plink_cohort1.1", "plink_cohort1.2")
phenotype_list = c("phenotype1_cohort1.bed.gz", "phenotype2_cohort1.bed.gz", "phenotype1_cohort2.bed.gz")
covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz")
conditions_list_individual = c("phenotype1_cohort1", "phenotype2_cohort1", "phenotype1_cohort2")
match_geno_pheno = c(1,1,2)
association_window = "chr1:1000000-2000000" # set to be the same as region for cis-analysis
extract_region_name = list(c("ENSG00000269699, ENSG00000789633"), c("ENSG00000269699"), c("ENSG00000269699", "ENSG00000789633"))
region_name_col = 4
keep_indel = TRUE
keep_samples = c("SAMPLE1", "SAMPLE2", "SAMPLE3")

# Following parameters need to be set according to your data
maf_cutoff = 0.01
mac_cutoff = 10
xvar_cutoff = 0
imiss_cutoff = 0.9

# More advanced parameters see pecotmr::load_multitask_regional_data()
region_data_individual <- load_multitask_regional_data(
    region = region,
    genotype_list = genotype_list,
    phenotype_list = phenotype_list,
    covariate_list = covariate_list,
    conditions_list_individual = conditions_list_individual,
    match_geno_pheno = match_geno_pheno,
    association_window = association_window,
    region_name_col = region_name_col,
    extract_region_name = extract_region_name,
    keep_indel = keep_indel,
    keep_samples = keep_samples,
    maf_cutoff = maf_cutoff,
    mac_cutoff = mac_cutoff,
    xvar_cutoff = xvar_cutoff,
    imiss_cutoff = imiss_cutoff
)

2.2. Loading summary statistics from multiple cohorts or datasets

Inputs:

sumstat_path_list: Character vector; Paths to the summary statistics.
column_file_path_list: Character vector; Paths to the column mapping files. See below for expected format.
LD_meta_file_path_list: Character vector; Paths to LD metadata files. See below for expected format.
conditions_list_sumstat: Character vector; Strings representing different sumstats used for naming. Must have the same length as the sumstat file vector.
match_LD_sumstat: List of character vectors; Mapping each LD metadata file to the summary-statistics conditions to pair with it. Length must equal LD_meta_file_path_list. Each element is a character vector of names present in conditions_list_sumstat. If omitted or an element is empty, defaults to all conditions for the first LD.
association_window: String; Genomic region of interest in the format of chr:start-end for the association analysis window of variants to test (cis or trans). If not provided, all genotype data will be loaded.
n_samples: Integer vector; Sample size. Set a 0 if n_cases/n_controls are passed explicitly. If unknown, set as 0 and include n_samples column in the column mapping file to retrieve from the sumstat file.
n_cases: Integer vector; Number of cases. Set a 0 if n_samples is passed explicitly. If unknown, set as 0 and include n_cases column in the column mapping file to retrieve from the sumstat file.
n_controls: Integer vector; Number of controls. Set a 0 if n_samples is passed explicitly. If unknown, set as 0 and include n_controls column in the column mapping file to retrieve from the sumstat file.

Outputs:

region_data: List (with individual_data, sumstat_data); Output of the load_multitask_regional_data function. If only summary statistics data is loaded, individual_data will be NULL.

Summary statistics loading example

The following example demonstrates how to set up input data with 2 summary statistics and one LD reference.

# Example of loading summary statistics
sumstat_path_list = c("sumstat1.tsv.gz", "sumstat2.tsv.gz")
column_file_path_list = c("column_mapping_sumstat1.yml", "column_mapping_sumstat2.yml")
LD_meta_file_path_list = c("ld_meta_file.tsv")
conditions_list_sumstat = c("sumstat_1", "sumstat_2")
match_LD_sumstat = c("sumstat_1", "sumstat_2")
association_window = "chr1:1000000-2000000"

# Following parameters need to be set according to your data
n_samples = c(300000, 0)
n_cases = c(0, 20000)
n_controls = c(0, 40000)


# More advanced parameters see pecotmr::load_multitask_regional_data()
region_data_sumstat <- load_multitask_regional_data(
    sumstat_path_list = sumstat_path_list,
    column_file_path_list = column_file_path_list,
    LD_meta_file_path_list = LD_meta_file_path_list,
    conditions_list_sumstat = conditions_list_sumstat,
    match_LD_sumstat = match_LD_sumstat,
    association_window = association_window,
    n_samples = n_samples,
    n_cases = n_cases,
    n_controls = n_controls
)

Expected format for column mapping file

The column mapping file is YAML (.yml) with key: value pairs mapping your input column names to the standardized names expected by the loader. Required columns are chrom, pos, A1, and A2, and either z or beta and sebeta. Either ‘n_case’ and ‘n_control’ or ‘n_samples’ can be passed as part of the column mapping, but will be overwritten by the n_cases and n_controls or n_samples parameters passed explicitly.

# required
chrom: chromosome
pos: position
A1: effect_allele
A2: non_effect_allele

z: zscore 
# or
beta: beta
sebeta: sebeta

# optional, will be overridden by n_samples or n_cases and n_controls if passed explicitly
n_case: NCASE
n_control: NCONTROL
# or
n_sample: N

Expected format for LD metadata file LD files should be in the format generated by for instance plink --r squared, then xz compressed. The LD metadata file is a tab-separated file with the following columns: - chrom: chromosome - start: start position - end: end position - ld_path, bim_path: path to the LD block file and bim file

1   1000000 2000000 ld_block_1.ld.gz,ld_block_1.bim

3. Perform ColocBoost using `colocboost_pipeline` function

In this section, we load region data for a combination of individual-level and summary statistics data, then perform the colocalization analysis using the colocboost_pipeline function. The colocalization analysis can be run in any one of three modes, or in a combination of these modes (names assume that individual-level data is xQTL data and summary statistics data is GWAS data):

xQTL-only mode: Only perform colocalization analysis on the individual-level data. Summary statistics data is not used.
joint GWAS mode: Perform colocalization analysis in disease-agnostic mode on the individual-level and summary statistics data together.
separate GWAS mode: Perform colocalization analysis in disease-prioritized mode on the the individual-level data and each summary statistics dataset separately, treating each summary statistics dataset as the focal trait.

Inputs:

region_data: List (with individual_data, sumstat_data); Output of the load_multitask_regional_data function.
focal_trait: String; For xQTL-only mode, the name of the trait to perform disease-prioritized ColocBoost, from conditions_list_individual. If not provided, xQTL-only mode will be run without disease-prioritized mode.
event_filters: List of character vectors; Patterns for filtering events based on context names. Example: for sQTL, list(type_pattern = ".*clu_(\\d+_[+-?]).*", valid_pattern = "clu_(\\d+_[+-?]):PR:", exclude_pattern = "clu_(\\d+_[+-?]):IN:").
maf_cutoff: Numeric; Minor allele frequency cutoff. Default is 0.005.
pip_cutoff_to_skip_ind: Integer vector; Cutoff values for skipping analysis based on pre-screening with single-effect SuSiE (L=1). Context is skipped if none of the variants in the context have PIP values greater than the cutoff. Default is 0 (does not run single-effect SuSiE). Passing a negative value sets the cutoff to 3/number of variants.
pip_cutoff_to_skip_sumstat: Integer vector; Cutoff values for skipping analysis based on pre-screening with single-effect SuSiE (L=1). Sumstat is skipped if none of the variants in the sumstat have PIP values greater than the cutoff. Default is 0 (does not run single-effect SuSiE). Passing a negative value sets the cutoff to 3/number of variants.
qc_method: String; Quality control method to use. Options are “rss_qc”, “dentist”, or “slalom”. Default is rss_qc.
impute: Logical; if TRUE, performs imputation for outliers identified in the analysis. Default is TRUE.
impute_opts: List of lists; Imputation options including rcond, R2_threshold, and minimum_ld. Default is list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5).
xqtl_coloc: Logical; if TRUE, performs xQTL-only mode. Default is TRUE.
joint_gwas: Logical; if TRUE, performs joint GWAS mode, mapping all individual-level and sumstat data together.Default is FALSE.
separate_gwas: Logical; if TRUE, runs separate GWAS mode, where each sumstat dataset is analyzed separately with all individual-level data, treating each sumstat as the focal trait in disease-prioritized mode. Default is FALSE.

Outputs:

colocboost_results: List of colocboost objects (with xqtl_coloc, joint_gwas, separate_gwas); Output of the colocboost_pipeline function. If the mode is not run, the corresponding element will be NULL.

#### Please check the example code below ####
# # load in individual-level and sumstat data
region_data_combined <- load_multitask_regional_data(
    region = region,
    genotype_list = genotype_list,
    phenotype_list = phenotype_list,
    covariate_list = covariate_list,
    conditions_list_individual = conditions_list_individual,
    match_geno_pheno = match_geno_pheno,
    association_window = association_window,
    region_name_col = region_name_col,
    extract_region_name = extract_region_name,
    keep_indel = keep_indel,
    keep_samples = keep_samples,
    maf_cutoff = maf_cutoff,
    mac_cutoff = mac_cutoff,
    xvar_cutoff = xvar_cutoff,
    imiss_cutoff = imiss_cutoff,
    sumstat_path_list = sumstat_path_list,
    column_file_path_list = column_file_path_list,
    LD_meta_file_path_list = LD_meta_file_path_list,
    conditions_list_sumstat = conditions_list_sumstat,
    match_LD_sumstat = match_LD_sumstat,
    n_samples = n_samples,
    n_cases = n_cases,
    n_controls = n_controls
)

maf_cutoff = 0.01
pip_cutoff_to_skip_ind = rep(0, length(phenotype_list))
pip_cutoff_to_skip_sumstat = rep(0, length(sumstat_path_list))
qc_method = "rss_qc" 

# run colocboost analysis
colocboost_results <- colocboost_pipeline(
    region_data_combined,
    maf_cutoff = maf_cutoff, 
    pip_cutoff_to_skip_ind = pip_cutoff_to_skip_ind,
    pip_cutoff_to_skip_sumstat = pip_cutoff_to_skip_sumstat,
    qc_method = qc_method,
    xqtl_coloc = TRUE,
    joint_gwas = TRUE,
    separate_gwas = TRUE
)

# visualize results for xQTL-only mode
colocboost_plot(colocboost_results$xqtl_coloc)

# visualize results for joint GWAS mode
colocboost_plot(colocboost_results$joint_gwas)

# visualize results for separate GWAS mode
for (i in 1:length(colocboost_results$separate_gwas)) {
    colocboost_plot(colocboost_results$separate_gwas[[i]])
}

1. ColocBoost analysis with basic QC steps

2. Loading Data using colocboost_pipeline function

2.1. Loading individual-level data from multiple cohorts

2.2. Loading summary statistics from multiple cohorts or datasets

3. Perform ColocBoost using colocboost_pipeline function

2. Loading Data using `colocboost_pipeline` function

3. Perform ColocBoost using `colocboost_pipeline` function