Multi-trait colocalization using ColocBoost#

Prerequisites#

Requires fine-mapping outputs (.susie.rds files) from mnm_regression or rss_analysis.

Description#

We use ColocBoost to perform colocalization analyses cf. Cao et al (2025). In short, ColocBoost performs colocalization while accounting for multiple causal variants within a genomic region of interest and can scale to hundreds of traits. We allow users to include or omit GWAS summary statistics. Required inputs are individual xQTL data from the same cohort (multiple phenotypes, same genotype). Linkage disequilibrium reference data is required if GWAS summary statistics are used.

Input#

  1. A list of regions to be analyzed (optional); the last column of this file should be region name.

  2. Either a list of per chromosome genotype files, or one file for genotype data of the entire genome. Genotype data has to be in PLINK bed format.

  3. Vector of lists of phenotype files per region to be analyzed, in UCSC bed.gz with index in bed.gz.tbi formats.

  4. Vector of covariate files corresponding to the lists above.

  5. Customized association windows file for variants (cis or trans). If it is not provided, a fixed sized window will be used around the region (a cis-window)

  6. Optionally a vector of names of the phenotypic conditions in the form of cond1 cond2 cond3 separated with whitespace.

  7. Optionally summary statistics meta-data file and LD reference meta-data file.

Input 2 and 3 should be outputs from genotype_per_region and annotate_coord modules in previous preprocessing steps. 4 should be output of covariate_preprocessing pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates.

Example genotype, phenotype and association analysis windows#

See this page for example inputs of these information.

Example summary statistics and LD reference#

See rss_analysis for LD reference formats (pre-computed blocks and PLINK genotype files), per-study LD reference via the ld_meta_data column in GWAS meta-data, and mixture LD panels.

About indels#

Use --no-indel to remove indel variants from analysis. This applies to both individual-level genotype loading and summary statistics QC. Default keeps indels.

Output#

For each analysis region the pipeline fits ColocBoost and saves the model in RDS format — one RDS per gene of interest (and per GWAS, when GWAS integration is used). Each RDS contains:

  • ucos_summary — a summary table of the colocalization events.

  • vpa — the variable colocalized probability for each variant (the probability of a variant being colocalized with at least one trait).

  • data_info — information on the input data.

  • model_info — information on the fitted ColocBoost model.

  • ucos_details — trait-specific (uncolocalized) effects information.

  • region_info — information on the region(s) of interest.

  • computing_time — time taken for each step (loading, QC and analysis).

Note: the suggested output naming convention is cohort_....

Minimal Working Example Steps#

Timing: ~X min

xQTL only analysis#

Below we duplicate the examples for phenotype and covariates to demonstrate that when there are multiple phenotypes for the same genotype it is possible to use this pipeline to analyze all of them (more than two is accepted as well).

Here using --region-name we focus the analysis on 3 genes. In practice if this parameter is dropped, the union of all regions in all phenotype region lists will be analyzed. It is possible for some of the regions there are no genotype data, in which case the pipeline will output RDS files with a warning message to indicate the lack of genotype data to analyze.

Note: Suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk.

Step 1. ColocBoost with xQTL only#

This is the minimal working example and runs end-to-end on the toy gene-expression data, producing a *.cb_xqtl.rds colocalization result for the region. The same command also accepts peaks or any other molecular phenotype - just swap in the phenotype .bed.gz/manifest.

# Toy gene-expression example (run from the toy_example root).
# xQTL-only ColocBoost: builds one QtlDataset, then per-gene ColocBoost into
# {name}.{gene}.colocboost.rds.
sos run pipeline/colocboost.ipynb colocboost \
    --name colocboost_xqtl --cwd output/colocboost_xqtl \
    --genoFile input/colocboost/example.chr22.bed \
    --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
    --covFile input/colocboost/example_covariates.tsv --transpose-covariates \
    --customized-association-windows input/colocboost/association_windows.bed \
    --region-name ENSG00000130538 \
    --no-separate-gwas --xqtl-coloc -j1

Step 2. ColocBoost with GWAS#

Adds GWAS summary statistics (via --gwas-meta-data) and an LD reference (via --ld-meta-data). The xQTL ColocBoost completes, but the separate GWAS-xQTL colocalization is still under development with the installed package (see the note in the command cell).

# Toy example with GWAS (separate-focal coloc per study), from toy_example root.
# Per gene a region GwasSumStats is built, then xQTL + separate-GWAS coloc are
# written to one combined {name}.{gene}.colocboost.rds.
sos run pipeline/colocboost.ipynb colocboost \
    --name colocboost_gwas --cwd output/colocboost_gwas \
    --genoFile input/colocboost/example.chr22.bed \
    --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
    --covFile input/colocboost/example_covariates.tsv --transpose-covariates \
    --customized-association-windows input/colocboost/association_windows.bed \
    --gwas-meta-data input/colocboost/gwas_meta.txt \
    --ld-meta-data input/ld_reference/protocol_example.ld_meta_file.tsv \
    --region-name ENSG00000130538 \
    --separate-gwas --xqtl-coloc -j1

Step 3. ColocBoost with multiple LD references (multi-ancestry)#

Uses a per-study LD reference declared in the GWAS meta-data. It shares the separate-GWAS code path with Step 2 and is therefore still under development on the toy dataset.

# Multi-ancestry template: each study uses the per-study LD reference named in
# the gwas meta ld_meta_data column (gwas_sumstats_construct.R resolves a single
# shared panel per region, so genuinely distinct per-study panels are not merged
# here). Same separate-GWAS code path as the example above.
sos run pipeline/colocboost.ipynb colocboost \
    --name colocboost_multi_ld --cwd output/colocboost_multi_ld \
    --genoFile input/colocboost/example.chr22.bed \
    --phenoFile input/colocboost/pheno_manifest_multicontext.tsv \
    --covFile input/colocboost/example_covariates.tsv --transpose-covariates \
    --customized-association-windows input/colocboost/association_windows.bed \
    --gwas-meta-data input/colocboost/gwas_meta.txt \
    --ld-meta-data input/ld_reference/protocol_example.ld_meta_file.tsv \
    --region-name ENSG00000130538 \
    --separate-gwas --xqtl-coloc -j1

Command interface#

sos run pipeline/colocboost.ipynb -h
usage: sos run /restricted/projectnb/xqtl/xqtl_protocol/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  get_analysis_regions
  get_rss_analysis_regions
  colocboost

Global Workflow Options:
  --name VAL (as str, required)
                        It is required to input the name of the analysis
  --cwd output (as path)
  --genoFile VAL (as path, required)
                        A list of file paths for genotype data, or the genotype data itself.
  --phenoFile  paths

                        One or multiple lists of file paths for phenotype data.
  --phenoIDFile  paths()

                        One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.
  --covFile  paths

                        Covariate file path
  --gwas-meta-data . (as path)
                        Summary statistics interface, see `rss_analysis.ipynb` for details
  --ld-meta-data . (as path)
  --gwas-name  (as list)
  --gwas-data  (as list)
  --column-mapping  (as list)
  --region-list . (as path)
                        Optional: if a region list is provide the analysis will be focused on provided region. The LAST column of this list will contain the ID of regions to focus on when
                        region_name is given Otherwise, all regions with both genotype and phenotype files will be analyzed
  --region-name  (as list)
                        Optional: if a region name is provided the analysis would be focused on the union of provides region list and region names
  --keep-samples . (as path)
                        Only focus on a subset of samples
  --keep-variants . (as path)
                        Only focus on a subset of variants
  --customized-association-windows . (as path)
                        An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID). If this list is not
                        provided, the default `window` parameter (see below) will be used.
  --cis-window -1 (as int)
                        Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp When this is set to negative, we will rely on using
                        customized_association_windows
  --[no-]save-data (default to False)
                        save data object or not
  --phenotype-names  [f'{x:bn}' for x in phenoFile]

                        Name of phenotypes
  --[no-]trans-analysis (default to False)
                        And indicator whether it is trans-analysis, ie, not using phenotypic coordinate information
  --imiss 1.0 (as float)
                        remove a variant if it has more than imiss missing individual level data
  --maf 0.0025 (as float)
                        MAF and variance of X cutoff
  --xvar-cutoff 0.0 (as float)
  --mac 5 (as int)
                        MAC cutoff, on top of MAF cutoff
  --[no-]indel (default to True)
                        Remove indels if indel = False
  --event-filter-rules . (as path)
  --skip-analysis-pip-cutoff  (as list)
                        If this value is not 0, then an initial single effect analysis will be performed to determine if follow up analysis will be continued or to simply return NULL If this is
                        negative we use a default way to determine this cutoff which is conservative but still useful
  --skip-sumstats-analysis-pip-cutoff -1.0 (as float)
  --[no-]xqtl-coloc (default to True)
                        Perform xQTL colocalization
  --[no-]joint-gwas (default to False)
                        Perform joint colocalization with many traits
  --[no-]separate-gwas (default to True)
                        Perform separate GWAS targeted anlaysis one at a time
  --[no-]impute (default to False)
                        Whether to impute the sumstat for all the snp in LD but not in sumstat.
  --rcond 0.01 (as float)
  --lamb 0.01 (as float)
  --R2-threshold 0.6 (as float)
  --minimum-ld 5 (as int)
  --qc-method NULL
                        summary stats QC methods: slalom, dentist
  --extract-sumstats-region-name NULL
  --sumstats-region-name-col NULL
                        Either an index number in str format or "NULL" as a string
  --comment-string NULL
  --ld-data-name  (as list)
  --container ''
                        Analysis environment settings
  --job-size 200 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 1h
                        Wall clock time expected
  --mem 20G
                        Memory expected
  --numThreads 1 (as int)
                        Number of threads

Sections
  get_analysis_regions:
  get_rss_analysis_regions:
  colocboost:

Workflow implementation#

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# It is required to input the name of the analysis
parameter: name = str
parameter: cwd = path("output")
# PLINK1 genotype: pass the .bed file (.bim/.fam are read automatically).
parameter: genoFile = path
# QtlDataset phenotype manifest TSV. Columns: ID, #chr, start, end, path,
# cond, and an optional cov_path (per-context covariates). The first path is
# used. Replaces the legacy list-of-phenotype-BEDs input.
parameter: phenoFile = paths
# Legacy phenotype ID-mapping file(s); accepted for CLI compatibility but not
# consumed by the manifest-based flow (the manifest carries the IDs).
parameter: phenoIDFile = paths()
# Uniform covariate file applied across all contexts (QTLtools-format
# rows=covariates supported via --transpose-covariates). Per-context
# covariates instead come from the manifest cov_path column.
parameter: covFile = paths()
# GWAS summary-statistics meta TSV. Columns: study_id, chrom, file_path;
# optional n_sample, n_case, n_control, ld_meta_data, column_mapping_file.
parameter: gwas_meta_data = path()
# Default LD-reference meta (#chr, start, end, path). Per-study overrides come
# from the gwas meta ld_meta_data column.
parameter: ld_meta_data = path()
# Optional region list whose LAST column lists region/gene IDs to analyze.
parameter: region_list = path()
# Region/gene IDs to analyze (space separated on the CLI).
parameter: region_name = []
# Optional per-gene association windows (BED: #chr start end ID) overriding the
# default cis_window for the GWAS extraction window.
parameter: customized_association_windows = path()
# cis window (bp) on each side of the gene for fine-mapping and the default
# GWAS extraction window.
parameter: cis_window = 1000000
# Save the per-gene colocboost input bundle (currently unused; reserved).
parameter: save_data = False
# QtlDataset study label and covariate handling (legacy CLI retained).
parameter: study = name
parameter: transpose_covariates = False
parameter: genotype_covariates = path('.')
# Remove a variant if it has more than imiss fraction missing genotypes.
parameter: imiss = 1.0
# MAF and variance-of-X cutoffs.
parameter: maf = 0.0025
parameter: xvar_cutoff = 0.0
# MAC cutoff, on top of the MAF cutoff.
parameter: mac = 5
# Keep indels when True.
parameter: indel = True
# Only keep a subset of samples / variants (whitespace/newline-separated IDs).
parameter: keep_samples = path()
parameter: keep_variants = path()
# Per-context single-effect skip cutoff passed to colocboost.R
# (--pip-cutoff-to-skip): a scalar applied to every context, or one or more
# context=value items (negative -> data-driven 3/ncol(X); empty -> 0).
parameter: skip_analysis_pip_cutoff = []
# Summary-stats single-effect skip cutoff passed to summaryStatsQc via
# gwas_sumstats_construct.R.
parameter: skip_sumstats_analysis_pip_cutoff = -1.0
# ColocBoost variants to run.
parameter: xqtl_coloc = True
parameter: joint_gwas = False
parameter: separate_gwas = True
# Summary-stats LD-mismatch QC method (none | slalom | dentist).
parameter: qc_method = "none"
# Analysis environment settings.
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
# For cluster jobs, number of commands to run per job.
parameter: job_size = 200
parameter: walltime = "1h"
parameter: mem = "20G"
parameter: numThreads = 1
[colocboost_1]
# Build one pecotmr::QtlDataset from the phenotype manifest + shared genotype
# and serialize to RDS (mirrors mnm_regression's qtl_dataset_construct). No
# fan-out: the per-gene [colocboost_3] step loads this single RDS and selects
# the focal gene/contexts at analysis time. Per-context covariates come from
# the manifest cov_path column; a uniform --covFile (or --genotype-covariates)
# applies across all contexts. ColocBoost runs on unscaled residuals
# (--no-scale-residuals), matching the legacy pipeline.
cov_arg = covFile[0] if (len(covFile) > 0 and covFile[0].is_file()) else genotype_covariates
input: None
output: f"{cwd:a}/qtl_dataset/{study}.qtl_dataset.rds"
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f"{step_name}_{_output:bn}"
bash: expand = '${ }', stderr = f"{_output}.stderr", stdout = f"{_output}.stdout", container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/pecotmr_integration/qtl_dataset_construct.R \
        --study ${study} \
        --genotype-prefix ${genoFile:n} \
        --phenotype-manifest ${phenoFile[0]} \
        --genotype-covariates ${cov_arg if cov_arg.is_file() else '""'} \
        ${'--transpose-covariates' if transpose_covariates else ''} \
        --maf-cutoff ${maf} \
        --xvar-cutoff ${xvar_cutoff} \
        --mac-cutoff ${mac} \
        --imiss-cutoff ${imiss} \
        ${('--keep-samples ' + str(keep_samples)) if keep_samples.is_file() else ''} \
        ${('--keep-variants ' + str(keep_variants)) if keep_variants.is_file() else ''} \
        ${'' if indel else '--drop-indel'} \
        --no-scale-residuals \
        --output ${_output}
[colocboost_2]
# Resolve per-gene analysis units (gene -> ld_block + grouped GWAS sources)
# into one manifest TSV via colocboost_manifest.R, so the next step can fan out
# over its rows with inline csv.DictReader (no notebook-local Python parsing).
input: None
output: f"{cwd:a}/colocboost/{name}.colocboost_manifest.tsv"
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f"{step_name}_{_output:bn}"
bash: expand = '${ }', stderr = f"{_output}.stderr", stdout = f"{_output}.stdout", container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/pecotmr_integration/colocboost_manifest.R \
        --pheno-manifest ${phenoFile[0]} \
        --cis-window ${cis_window} \
        ${('--customized-association-windows ' + str(customized_association_windows)) if customized_association_windows.is_file() else ''} \
        ${('--region-name ' + ','.join(region_name)) if len(region_name) > 0 else ''} \
        ${('--region-list ' + str(region_list)) if region_list.is_file() else ''} \
        ${('--gwas-meta ' + str(gwas_meta_data)) if gwas_meta_data.is_file() else ''} \
        ${('--ld-meta ' + str(ld_meta_data)) if ld_meta_data.is_file() else ''} \
        --output ${_output}

ColocBoost analysis#

[colocboost_3]
# Per-gene ColocBoost over the pre-built QtlDataset. Fans out over the manifest
# rows; per gene (1) builds a region GwasSumStats via gwas_sumstats_construct.R
# when the gene has GWAS studies, then (2) runs colocboost.R (xQTL /
# joint-GWAS / separate-GWAS variants) into ONE combined RDS. Replaces the
# legacy load_multitask_regional_data + colocboost_analysis_pipeline path.
import csv
manifest = f"{cwd:a}/colocboost/{name}.colocboost_manifest.tsv"
jobs = list(csv.DictReader(open(manifest), delimiter='\t'))
stop_if(len(jobs) == 0, "colocboost: empty manifest; check --region-name / phenotype manifest.")
input: f"{cwd:a}/qtl_dataset/{name}.qtl_dataset.rds", for_each = "jobs"
output: f"{cwd:a}/colocboost/{name}.{_jobs['region_id']}.colocboost.rds"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f"{step_name}_{_output:bn}"
bash: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    set -e
    gene=${_jobs['gene_id']}
    studies="${_jobs['studies']}"
    gss_arg=""
    if [ -n "$studies" ]; then
        gss="${_output:nn}.gwas_sumstats.rds"
        Rscript ${modular_script_dir}/pecotmr_integration/gwas_sumstats_construct.R \
            --study "$studies" \
            --gwas-tsv "${_jobs['gwas_tsvs']}" \
            --ld-block "${_jobs['ld_block']}" \
            --ld-meta "${_jobs['ld_meta']}" \
            --n-case "${_jobs['n_cases']}" \
            --n-control "${_jobs['n_controls']}" \
            --pip-cutoff-to-skip ${skip_sumstats_analysis_pip_cutoff} \
            --qc-method ${qc_method} \
            ${('--column-mapping "' + _jobs['column_mappings'] + '"') if _jobs['column_mappings'] else ''} \
            --output "$gss"
        gss_arg="--gwas-sumstats $gss"
    fi
    Rscript ${modular_script_dir}/pecotmr_integration/colocboost.R \
        --qtl-dataset ${_input} \
        --gene-id $gene \
        --cis-window ${cis_window} \
        $gss_arg \
        ${'' if xqtl_coloc else '--no-xqtl-coloc'} \
        ${'--joint-gwas' if joint_gwas else ''} \
        ${'--separate-gwas' if separate_gwas else ''} \
        --pip-cutoff-to-skip "${",".join([str(x) for x in skip_analysis_pip_cutoff])}" \
        --output ${_output}

Troubleshooting#

Step

Substep

Problem

Possible Reason

Solution

Anticipated Results#

For the toy chr22 dataset, produces protocol_example.colocboost.rds with ColocBoost posteriors. Expect a convergence message and sparse result for the single-region toy example.