Multi-trait colocalization using ColocBoost

Multi-trait colocalization using ColocBoost#

Description#

We use ColocBoost to perform colocalization analyses cf. Cao et al (2025). In short, ColocBoost performs colocalization while accounting for multiple causal variants within a genomic region of interest and can scale to hundreds of traits. We allow users to include or omit GWAS summary statistics. Required inputs are individual xQTL data from the same cohort (multiple phenotypes, same genotype). Linkage disequilibrium reference data is required if GWAS summary statistics are used.

Input#

A list of regions to be analyzed (optional); the last column of this file should be region name.
Either a list of per chromosome genotype files, or one file for genotype data of the entire genome. Genotype data has to be in PLINK bed format.
Vector of lists of phenotype files per region to be analyzed, in UCSC bed.gz with index in bed.gz.tbi formats.
Vector of covariate files corresponding to the lists above.
Customized association windows file for variants (cis or trans). If it is not provided, a fixed sized window will be used around the region (a cis-window)
Optionally a vector of names of the phenotypic conditions in the form of cond1 cond2 cond3 separated with whitespace.
Optionally summary statistics meta-data file and LD reference meta-data file.

Input 2 and 3 should be outputs from genotype_per_region and annotate_coord modules in previous preprocessing steps. 4 should be output of covariate_preprocessing pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates.

Example genotype, phenotype and association analysis windows#

See this page for example inputs of these information.

Example summary statistics and LD reference#

See this page for example inputs of these information.

About indels#

Option --no-indel will remove indel from analysis. FIXME: Gao need to provide more guidelines how to deal with indels in practice.

Output#

For each analysis region, the output are various ColocBoost models fitted and saved in RDS format. An RDS file with the following values will be generated for each gene of interest and for each GWAS, if GWAS integration is used.

ucos_summary - a summary table of colocalization events
vpa - the variable colocalized probability for each variable, or the probability of a variant being colocalized with at least one trait.
data_info - an object containing information on the input data
model_info - an object containing information on the colocboost model
ucos_details - contains trait specific (uncolocalized) effects information
region_info - information on the region(s) of interest
computing_time - time to run the steps of coloc boost (loading, QC and analysis)

Minimal Working Example Steps#

Timing [FIXME]

xQTL only anlaysis#

Below we duplicate the examples for phenotype and covariates to demonstrate that when there are multiple phenotypes for the same genotype it is possible to use this pipeline to analyze all of them (more than two is accepted as well).

Here using --region-name we focus the analysis on 3 genes. In practice if this parameter is dropped, the union of all regions in all phenotype region lists will be analyzed. It is possible for some of the regions there are no genotype data, in which case the pipeline will output RDS files with a warning message to indicate the lack of genotype data to analyze.

Note: Suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk.

sos run pipeline/colocboost.ipynb colocboost \
    --name test_coloc_boost_xqtl_only  \
    --cwd output/colocboost \
    --genoFile output/plink/wgs.merged.bed \
    --phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \
    --covFile output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --no-separate-gwas --xqtl-coloc \
    --region-name ENSG00000239945 \
    --phenotype-names trait_A

INFO: Running get_rss_analysis_regions: 
INFO: get_rss_analysis_regions is completed.
INFO: get_rss_analysis_regions output:   regional_rss_data
INFO: Running get_analysis_regions: 
Loading customized association analysis window from reference_data/TAD/TADB_enhanced_cis.bed
INFO: get_analysis_regions is completed.
INFO: get_analysis_regions output:   regional_data
INFO: Running colocboost: 
INFO: colocboost is completed.
INFO: colocboost output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/colocboost/colocboost/test_coloc_boost_xqtl_only.chr1_ENSG00000239945.cb_xqtl.rds
INFO: Workflow colocboost (ID=w675afec4b9a12f39) is executed successfully with 3 completed steps.

It is also possible to analyze a selected list of regions using option --region-list. The last column of this file will be used for the list to analyze. Here for example use the same list of regions as we used for customized association-window:

sos run pipeline/colocboost.ipynb colcoboost  \
    --name protocol_example_protein  \
    --genoFile input/xqtl_association/protocol_example.genotype.chr21_22.bed   \
    --phenoFile output/phenotype/protocol_example.protein.region_list.txt \
                output/phenotype/protocol_example.protein.region_list.txt \
    --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
              output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz  \
    --customized-association-windows input/xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --no-separate-gwas --xqtl-coloc \
    --region-list xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
    --phenotype-names trait_A trait_B

Note: When both --region-name and --region-list are used, the union of regions from these parameters will be analyzed.

GWAS-xQTL integration#

sos run pipeline/colocboost.ipynb colocboost \
    --name colocboost_gwas  \
    --cwd output/colocboost_gwas \
    --genoFile output/plink/wgs.merged.bed \
    --phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \
    --covFile output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --ld-meta-data data/ld_meta_file.tsv \
    --gwas-meta-data data/colocboost/gwas_meta.txt \
    --separate-gwas --xqtl-coloc \
    --region-name ENSG00000239945 \
    --phenotype-names trait_A

INFO: Running get_rss_analysis_regions: 
Using customized association window file: reference_data/TAD/TADB_enhanced_cis.bed
INFO: Running get_analysis_regions: 
INFO: get_rss_analysis_regions is completed.
INFO: get_rss_analysis_regions output:   regional_rss_data
Loading customized association analysis window from reference_data/TAD/TADB_enhanced_cis.bed
INFO: get_analysis_regions is completed.
INFO: get_analysis_regions output:   regional_data
INFO: Running colocboost: 
INFO: colocboost is completed.
INFO: colocboost output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/colocboost_gwas/colocboost/colocboost_gwas.chr1_ENSG00000239945.cb_xqtl.rds /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/colocboost_gwas/colocboost/colocboost_gwas.chr1_ENSG00000239945.cb_xqtl_Wightman.rds
INFO: Workflow colocboost (ID=w39fc8e9aa4cca1be) is executed successfully with 3 completed steps.

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command interface#

sos run colocboost.ipynb -h

usage: sos run /restricted/projectnb/xqtl/xqtl_protocol/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  get_analysis_regions
  get_rss_analysis_regions
  colocboost

Global Workflow Options:
  --name VAL (as str, required)
                        It is required to input the name of the analysis
  --cwd output (as path)
  --genoFile VAL (as path, required)
                        A list of file paths for genotype data, or the genotype data itself.
  --phenoFile  paths

                        One or multiple lists of file paths for phenotype data.
  --phenoIDFile  paths()

                        One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.
  --covFile  paths

                        Covariate file path
  --gwas-meta-data . (as path)
                        Summary statistics interface, see `rss_analysis.ipynb` for details
  --ld-meta-data . (as path)
  --gwas-name  (as list)
  --gwas-data  (as list)
  --column-mapping  (as list)
  --region-list . (as path)
                        Optional: if a region list is provide the analysis will be focused on provided region. The LAST column of this list will contain the ID of regions to focus on when
                        region_name is given Otherwise, all regions with both genotype and phenotype files will be analyzed
  --region-name  (as list)
                        Optional: if a region name is provided the analysis would be focused on the union of provides region list and region names
  --keep-samples . (as path)
                        Only focus on a subset of samples
  --keep-variants . (as path)
                        Only focus on a subset of variants
  --customized-association-windows . (as path)
                        An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID). If this list is not
                        provided, the default `window` parameter (see below) will be used.
  --cis-window -1 (as int)
                        Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp When this is set to negative, we will rely on using
                        customized_association_windows
  --[no-]save-data (default to False)
                        save data object or not
  --phenotype-names  [f'{x:bn}' for x in phenoFile]

                        Name of phenotypes
  --[no-]trans-analysis (default to False)
                        And indicator whether it is trans-analysis, ie, not using phenotypic coordinate information
  --seed 999 (as int)
  --imiss 1.0 (as float)
                        remove a variant if it has more than imiss missing individual level data
  --maf 0.0025 (as float)
                        MAF and variance of X cutoff
  --xvar-cutoff 0.0 (as float)
  --mac 5 (as int)
                        MAC cutoff, on top of MAF cutoff
  --[no-]indel (default to True)
                        Remove indels if indel = False
  --event-filter-rules . (as path)
  --skip-analysis-pip-cutoff  (as list)
                        If this value is not 0, then an initial single effect analysis will be performed to determine if follow up analysis will be continued or to simply return NULL If this is
                        negative we use a default way to determine this cutoff which is conservative but still useful
  --skip-sumstats-analysis-pip-cutoff -1.0 (as float)
  --[no-]xqtl-coloc (default to True)
                        Perform xQTL colocalization
  --[no-]joint-gwas (default to False)
                        Perform joint colocalization with many traits
  --[no-]separate-gwas (default to True)
                        Perform separate GWAS targeted anlaysis one at a time
  --[no-]impute (default to False)
                        Whether to impute the sumstat for all the snp in LD but not in sumstat.
  --rcond 0.01 (as float)
  --lamb 0.01 (as float)
  --R2-threshold 0.6 (as float)
  --minimum-ld 5 (as int)
  --qc-method NULL
                        summary stats QC methods: rss_qc, dentist, slalom
  --extract-sumstats-region-name NULL
  --sumstats-region-name-col NULL
                        Either an index number in str format or "NULL" as a string
  --comment-string NULL
  --ld-data-name  (as list)
  --container ''
                        Analysis environment settings
  --entrypoint ''
  --job-size 200 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 1h
                        Wall clock time expected
  --mem 20G
                        Memory expected
  --numThreads 1 (as int)
                        Number of threads

Sections
  get_analysis_regions:
  get_rss_analysis_regions:
  colocboost:

Workflow implementation#

[global]
# It is required to input the name of the analysis
parameter: name = str
parameter: cwd = path("output")
# A list of file paths for genotype data, or the genotype data itself. 
parameter: genoFile = path
# One or multiple lists of file paths for phenotype data.
parameter: phenoFile = paths
# One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.
parameter: phenoIDFile = paths()
# Covariate file path
parameter: covFile = paths
# Summary statistics interface, see `rss_analysis.ipynb` for details
parameter: gwas_meta_data = path()
parameter: ld_meta_data = path()
parameter: gwas_name = []
parameter: gwas_data = []
parameter: column_mapping = []
# Optional: if a region list is provide the analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on when region_name is given
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# Only focus on a subset of samples
parameter: keep_samples = path()
# Only focus on a subset of variants
parameter: keep_variants = path()
# An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_association_windows = path()
# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# When this is set to negative, we will rely on using customized_association_windows
parameter: cis_window = -1
# save data object or not
parameter: save_data = False
# Name of phenotypes
parameter: phenotype_names = [f'{x:bn}' for x in phenoFile]
# And indicator whether it is trans-analysis, ie, not using phenotypic coordinate information
parameter: trans_analysis = False
parameter: seed = 999

# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 1.0
# MAF and variance of X cutoff
parameter: maf = 0.0025
parameter: xvar_cutoff = 0.0
# MAC cutoff, on top of MAF cutoff
parameter: mac = 5
# Remove indels if indel = False
parameter: indel = True
parameter: event_filter_rules = path()
# If this value is not 0, then an initial single effect analysis will be performed 
# to determine if follow up analysis will be continued or to simply return NULL
# If this is negative we use a default way to determine this cutoff which is conservative but still useful
parameter: skip_analysis_pip_cutoff = []
parameter: skip_sumstats_analysis_pip_cutoff = -1.0
# Perform xQTL colocalization
parameter: xqtl_coloc = True
# Perform joint colocalization with many traits
parameter: joint_gwas = False
# Perform separate GWAS targeted anlaysis one at a time
parameter: separate_gwas = True
# Whether to impute the sumstat for all the snp in LD but not in sumstat.
parameter: impute = False 
parameter: rcond = 0.01
parameter: lamb = 0.01
parameter: R2_threshold = 0.6
parameter: minimum_ld = 5
# summary stats QC methods: rss_qc, dentist, slalom
parameter: qc_method = "NULL"
parameter: extract_sumstats_region_name = "NULL"
# Either an index number in str format or "NULL" as a string
parameter: sumstats_region_name_col = "NULL"
parameter: comment_string = "NULL"
parameter: ld_data_name = []

# Analysis environment settings
parameter: container = ""
parameter: entrypoint= ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "1h"
# Memory expected
parameter: mem = "20G"
# Number of threads
parameter: numThreads = 1

if len(phenoFile) != len(covFile):
    raise ValueError("Number of input phenotypes files must match that of covariates files")
if len(phenoFile) != len(phenotype_names):
    raise ValueError("Number of input phenotypes files must match the number of phenotype names")
if len(phenoIDFile) > 0 and len(phenoFile) != len(phenoIDFile):
    raise ValueError("Number of input phenotypes files must match the number of phenotype ID mapping files")

if len(skip_analysis_pip_cutoff) == 0:
    skip_analysis_pip_cutoff = [0.0] * len(phenoFile)
if len(skip_analysis_pip_cutoff) == 1:
    skip_analysis_pip_cutoff = skip_analysis_pip_cutoff * len(phenoFile)
if len(skip_analysis_pip_cutoff) != len(phenoFile):
    raise ValueError(f"``skip_analysis_pip_cutoff`` should have either length 1 or length the same as phenotype files ({len(phenoFile)} in this case)")

# make it into an R List string
skip_analysis_pip_cutoff = [f"'{y}'={x}" for x,y in zip(skip_analysis_pip_cutoff, phenotype_names)]
    
def group_by_region(lst, partition):
    # from itertools import accumulate
    # partition = [len(x) for x in partition]
    # Compute the cumulative sums once
    # cumsum_vector = list(accumulate(partition))
    # Use slicing based on the cumulative sums
    # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
    return partition

import os
import pandas as pd

def adapt_file_path(file_path, reference_file):
    """
    Adapt a single file path based on its existence and a reference file's path.

    Args:
    - file_path (str): The file path to adapt.
    - reference_file (str): File path to use as a reference for adaptation.

    Returns:
    - str: Adapted file path.

    Raises:
    - FileNotFoundError: If no valid file path is found.
    """
    reference_path = os.path.dirname(reference_file)

    # Check if the file exists
    if os.path.isfile(file_path):
        return file_path

    # Check file name without path
    file_name = os.path.basename(file_path)
    if os.path.isfile(file_name):
        return file_name

    # Check file name in reference file's directory
    file_in_ref_dir = os.path.join(reference_path, file_name)
    if os.path.isfile(file_in_ref_dir):
        return file_in_ref_dir

    # Check original file path prefixed with reference file's directory
    file_prefixed = os.path.join(reference_path, file_path)
    if os.path.isfile(file_prefixed):
        return file_prefixed

    # If all checks fail, raise an error
    raise FileNotFoundError(f"No valid path found for file: {file_path}")

def adapt_file_path_all(df, column_name, reference_file):
    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))

[get_analysis_regions: shared = "regional_data"]
# input is genoFile, phenoFile, covFile and optionally region_list. If region_list presents then we only analyze what's contained in the list.
# regional_data should be a dictionary like:
#{'data': [("genotype_1.bed", "phenotype_1.bed.gz", "covariate_1.gz"), ("genotype_2.bed", "phenotype_1.bed.gz", "phenotype_2.bed.gz", "covariate_1.gz", "covariate_2.gz") ... ],
# 'meta_info': [("chr12:752578-752579","chr12:752577-752580", "gene_1", "trait_1"), ("chr13:852580-852581","chr13:852579-852580", "gene_2", "trait_1", "trait_2") ... ]}
import numpy as np

def preload_id_map(id_map_files):
    id_maps = {}
    missing_files = []
    
    for id_map_file in id_map_files:
        if id_map_file is not None:
            if os.path.isfile(id_map_file):
                df = pd.read_csv(id_map_file, sep=r"\s+", header=None, comment='#', names=['old_ID', 'new_ID'])
                id_maps[id_map_file] = df.set_index('old_ID')['new_ID'].to_dict()
            else:
                missing_files.append(id_map_file)
    
    # If there are missing files, print a summary
    # if missing_files:
    #    print(f"Warning: {len(missing_files)} ID map file(s) specified but not found:")
    #    for file in missing_files:
    #        print(f"  - {file}")
    
    return id_maps

def load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps):
    pheno_df = pd.read_csv(pheno_path, sep=r"\s+", header=0)
    pheno_df['Original_ID'] = pheno_df['ID']
    
    # Check if id_map_path is specified
    if id_map_path is not None:
        # Check if the id_map_path exists in preloaded maps
        if id_map_path in preloaded_id_maps:
            id_map = preloaded_id_maps[id_map_path]
            pheno_df['ID'] = pheno_df['ID'].map(id_map).fillna(pheno_df['ID'])
        else:
            # ID map file was specified but doesn't exist or wasn't loaded
            print(f"Warning: ID map file '{id_map_path}' was specified but does not exist. Using original IDs.")
            # No mapping is done here, so ID remains the same as Original_ID
    
    return pheno_df

def filter_by_region_ids(data, region_ids):
    if region_ids is not None and len(region_ids) > 0:
        data = data[data['ID'].isin(region_ids)]
    return data

def custom_join(series):
    # Initialize an empty list to hold the processed items
    result = []
    for item in series:
        if ',' in item:
            # If the item contains commas, split by comma and convert to tuple
            result.append(tuple(item.split(',')))
        else:
            # If the item does not contain commas, add it directly
            result.append(item)
    # Convert the list of items to a tuple and return
    return tuple(result)

def aggregate_phenotype_data(accumulated_pheno_df):
    if not accumulated_pheno_df.empty:
        accumulated_pheno_df = accumulated_pheno_df.groupby(['#chr','ID','cond','path','cov_path'], as_index=False).agg({
            '#chr': lambda x: np.unique(x).astype(str)[0],
            'ID': lambda x: np.unique(x).astype(str)[0],
            'Original_ID': ','.join,
            'start': 'min',
            'end': 'max'
        }).groupby(['#chr','ID'], as_index=False).agg({
            'cond': ','.join,
            'path': ','.join,
            'Original_ID': custom_join,
            'cov_path': ','.join,
            'start': 'min',
            'end': 'max'
        })
    return accumulated_pheno_df

def process_cis_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, preloaded_id_maps):
    '''
    Example output:
    #chr    start      end    ID  Original_ID   path     cov_path             cond
    chr12   752578   752579  ENSG00000060237  Q9H4A3,P62873  protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B
    '''
    accumulated_pheno_df = pd.DataFrame()
    pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files

    for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
        if not os.path.isfile(cov_path):
            raise FileNotFoundError(f"No valid path found for file: {cov_path}")
        pheno_df = load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps)
        pheno_df = filter_by_region_ids(pheno_df, region_ids)
        
        if not pheno_df.empty:
            pheno_df.iloc[:, 4] = adapt_file_path_all(pheno_df, pheno_df.columns[4], f"{pheno_path:a}")
            pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)           
            accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)

    accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)
    return accumulated_pheno_df

def process_trans_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, customized_association_windows):
    '''
    Example output:
    #chr    start      end    ID  Original_ID   path     cov_path             cond
    chr21   0   0  chr21_18133254_19330300  carnitine,benzoate,hippurate  metabolon_1.bed.gz,metabolon_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B
    '''
    
    if not os.path.isfile(customized_association_windows):
        raise ValueError("Customized association analysis window must be specified for trans analysis.")
    accumulated_pheno_df = pd.DataFrame()
    pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files
    genotype_windows = pd.read_csv(customized_association_windows, comment="#", header=None, names=["#chr","start","end","ID"], sep="\t")
    genotype_windows = filter_by_region_ids(genotype_windows, region_ids)
    if genotype_windows.empty:
        return accumulated_pheno_df
    
    for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
        if not os.path.isfile(cov_path):
            raise FileNotFoundError(f"No valid path found for file: {cov_path}")
        pheno_df = pd.read_csv(pheno_path, sep=r"\s+", header=0, names=['Original_ID', 'path'])
        if not pheno_df.empty:
            pheno_df.iloc[:, -1] = adapt_file_path_all(pheno_df, pheno_df.columns[-1], f"{pheno_path:a}")
            pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)
            # Here we combine genotype_windows which contains "#chr" and "ID" to pheno_df by creating a cartesian product
            pheno_df = pd.merge(genotype_windows.assign(key=1), pheno_df.assign(key=1), on='key').drop('key', axis=1)
            # then set start and end columns to zero
            pheno_df['start'] = 0
            pheno_df['end'] = 0
            if id_map_path is not None:
                # Filter pheno_df by specific association-window and phenotype pairs
                association_analysis_pair = pd.read_csv(id_map_path, sep=r"\s+", header=None, comment='#', names=['ID', 'Original_ID'])
                pheno_df = pd.merge(pheno_df, association_analysis_pair, on=['ID', 'Original_ID'])
            accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)

    accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)
    return accumulated_pheno_df

def load_regional_data(genoFile, phenoFile, covFile, phenotype_names, phenoIDFile, 
                      region_list=None, region_name=None, trans_analysis=False, 
                      customized_association_windows=None, cis_window=-1):   
    # Ensure region_name is a list
    if region_name is None:
        region_name = []
    
    # Load genotype meta data
    if f"{genoFile:x}" == ".bed":
        geno_meta_data = pd.DataFrame([("chr"+str(x), f"{genoFile:a}") for x in range(1,23)] + [("chrX", f"{genoFile:a}")], columns=['#chr', 'geno_path'])
    else:
        geno_meta_data = pd.read_csv(f"{genoFile:a}", sep=r"\s+", header=0)
        geno_meta_data.iloc[:, 1] = adapt_file_path_all(geno_meta_data, geno_meta_data.columns[1], f"{genoFile:a}")
        geno_meta_data.columns = ['#chr', 'geno_path']
        geno_meta_data['#chr'] = geno_meta_data['#chr'].apply(lambda x: str(x) if str(x).startswith('chr') else f'chr{x}')

    # Checking the DataFrame
    valid_chr_values = [f'chr{x}' for x in range(1, 23)] + ['chrX']
    if not all(value in valid_chr_values for value in geno_meta_data['#chr']):
        raise ValueError("Invalid chromosome values found. Allowed values are chr1 to chr22 and chrX.")

    region_ids = []
    # If region_list is provided, read the file and extract IDs
    if region_list is not None and region_list.is_file():
        region_list_df = pd.read_csv(region_list, delim_whitespace=True, header=None, comment="#")
        region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs

    # If region_name is provided, include those IDs as well
    # --region-name A B C will result in a list of ["A", "B", "C"] here
    if len(region_name) > 0:
        region_ids = list(set(region_ids).union(set(region_name)))

    if trans_analysis:
        meta_data = process_trans_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, customized_association_windows)
    else:
        meta_data = process_cis_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, preload_id_map(phenoIDFile))
        
    if not meta_data.empty:
        meta_data = meta_data.merge(geno_meta_data, on='#chr', how='inner')
        # Adjust association-window
        if customized_association_windows is not None and os.path.isfile(customized_association_windows):
            print(f"Loading customized association analysis window from {customized_association_windows}")
            association_windows_list = pd.read_csv(customized_association_windows, comment="#", header=None, 
                                                  names=["#chr","start","end","ID"], sep="\t")
            meta_data = pd.merge(meta_data, association_windows_list, on=['#chr', 'ID'], how='left', suffixes=('', '_association'))
            mismatches = meta_data[meta_data['start_association'].isna()]
            if not mismatches.empty:
                print("First 5 mismatches:")
                print(mismatches[['ID']].head())
                raise ValueError(f"{len(mismatches)} regions to analyze cannot be found in ``{customized_association_windows}``. "
                                f"Please check your ``{customized_association_windows}`` database to make sure it contains all "
                                f"association-window definitions. ")
        else:
            if cis_window < 0:
                raise ValueError("Please either input valid path to association-window file via ``--customized-association-windows``, "
                                "or set ``--cis-window`` to a non-negative integer.")
            if cis_window == 0:
                print("Warning: only variants within the range of start and end of molecular phenotype will be considered "
                     "since cis_window is set to zero and no customized association window file was found. "
                     "Please make sure this is by design.")
            meta_data['start_association'] = meta_data['start'].apply(lambda x: max(x - cis_window, 0))
            meta_data['end_association'] = meta_data['end'] + cis_window

        # Example meta_data:
        # #chr    start      end    start_association       end_association           ID  Original_ID   path     cov_path             cond             coordinate     geno_path
        # 0  chr12   752578   752579  652578   852579  ENSG00000060237  Q9H4A3,P62873  protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B    chr12:752578-752579  protocol_example.genotype.chr21_22.bed       
        # Create the final dictionary
        regional_data = {
            'data': [(row['geno_path'], *row['path'].split(','), *row['cov_path'].split(',')) for _, row in meta_data.iterrows()],
            'meta_info': [(f"{row['#chr']}:{row['start']}-{row['end']}", # this is the phenotypic region to extract data from
                          f"{row['#chr']}:{row['start_association']}-{row['end_association']}", # this is the association window region
                          row['ID'], row['Original_ID'], *row['cond'].split(',')) for _, row in meta_data.iterrows()]
        }
    else:
        regional_data = {'data': [], 'meta_info': []}
        
    return regional_data


# Modified main code section to document the key format alignment
regional_data = load_regional_data(
    genoFile=genoFile, 
    phenoFile=phenoFile, 
    covFile=covFile, 
    phenotype_names=phenotype_names, 
    phenoIDFile=phenoIDFile,
    region_list=region_list, 
    region_name=region_name, 
    trans_analysis=trans_analysis,
    customized_association_windows=customized_association_windows, 
    cis_window=cis_window
)

[get_rss_analysis_regions: shared = "regional_rss_data"]

from collections import OrderedDict

def file_exists(file_path, relative_path=None):
    """Check if a file exists at the given path or relative to a specified path."""
    if os.path.exists(file_path) and os.path.isfile(file_path):
        return True
    elif relative_path:
        relative_file_path = os.path.join(relative_path, file_path)
        return os.path.exists(relative_file_path) and os.path.isfile(relative_file_path)
    return False

def check_required_columns(df, required_columns):
    """Check if the required columns are present in the dataframe."""
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        raise ValueError(f"Missing required columns: {', '.join(missing_columns)}")

def parse_region(region):
    """
    Parse a region string in 'chr:start-end' format into a list [chr, start, end].
    Returns a list containing [chromosome_with_chr_prefix, start_position, end_position].
    """
    # Handle both numeric and 'chr' prefixed chromosome formats
    if ':' in region:
        chrom, rest = region.split(':')
        # Store original chrom value with 'chr' prefix preserved
        original_chrom = chrom
        
        # Strip 'chr' prefix if present (for internal processing)
        if chrom.startswith('chr'):
            chrom = chrom[3:]  # Remove 'chr' prefix if present
        
        # Handle both hyphen and underscore separators for start-end
        if '-' in rest:
            start, end = rest.split('-')
        elif '_' in rest:
            start, end = rest.split('_')
        else:
            raise ValueError(f"Invalid region format: {region}. Expected format: chr:start-end or chr:start_end")
            
        # Return with the original chromosome format that includes 'chr' prefix
        return [original_chrom, int(start), int(end)]
    else:
        raise ValueError(f"Invalid region format: {region}. Expected format: chr:start-end or chr:start_end")

def resolve_region_names(region_name, customized_association_windows=None, region_list=None):
    """
    Resolve region names by looking them up in either customized_association_windows or region_list.
    Returns a dictionary mapping region keys to parsed regions.
    """
    region_dict = {}
    
    # If no region names provided, return empty dict
    if not region_name:
        return region_dict
    
    # Check if either file exists
    if not ((customized_association_windows and os.path.isfile(customized_association_windows)) or 
            (region_list and os.path.isfile(region_list))):
        raise ValueError("Either customized association window file or region list file must exist when region names are provided")
    
    # Determine which file to use (prioritize customized_association_windows)
    region_file = None
    if customized_association_windows and os.path.isfile(customized_association_windows):
        region_file = customized_association_windows
        print(f"Using customized association window file: {region_file}")
    elif region_list and os.path.isfile(region_list):
        region_file = region_list
        print(f"Using region list file: {region_file}")
    
    # Load region definitions from the file
    name_to_region_map = {}
    with open(region_file, 'r') as file:
        for line in file:
            # Skip empty lines and comments
            if not line.strip() or line.startswith("#"):
                continue
                
            parts = line.strip().split()
            if len(parts) >= 4:  # Ensure there's at least 4 columns
                # Extract ID column (4th column)
                region_id = parts[3]
                
                # Extract chromosome, start, and end positions
                chrom = parts[0].replace("chr", "")  # Remove 'chr' prefix if present
                try:
                    # Create the region info
                    parsed_region = [int(chrom) if chrom.isdigit() else chrom, 
                                    int(parts[1]), 
                                    int(parts[2])]
                    
                    # Include the ID
                    parsed_region.append(parts[3])
                    
                    # Add to the mapping
                    name_to_region_map[region_id] = parsed_region
                except ValueError:
                    # Skip lines with non-integer positions
                    print(f"Warning: Skipping invalid region definition: {line.strip()}")
    
    # Process each region name
    for region in region_name:
        # Check if this region name exists in our mapping
        if region in name_to_region_map:
            parsed_region = name_to_region_map[region]
        else:
            # Try to parse it as a coordinate string as a last resort
            try:
                parsed_region = parse_region(region)
                print(f"Warning: Region '{region}' not found in file, parsing as coordinate string")
            except ValueError:
                raise ValueError(f"Could not parse region '{region}' and it was not found in the region file")
        
        # Format the region key
        region_key = f"chr{parsed_region[0]}:{parsed_region[1]}-{parsed_region[2]}"
        if region_key not in region_dict:
            region_dict[region_key] = parsed_region
    
    return region_dict

def load_regional_rss_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, ld_data_name, region_name=None, region_list=None):
    """
    Extracts data from GWAS metadata files and additional GWAS data provided. 
    Uses the same region information as load_regional_data to ensure consistent
    region identification across both data structures.
    
    IMPORTANT: This function processes the same regions as load_regional_data and 
    formats the region keys in regional_rss_data["regions"] to exactly match the 
    2nd element of regional_data["meta_info"] tuples (chr:start-end format).

    Args:
    - gwas_meta_data (str): File path to the GWAS metadata file.
    - gwas_name (list): Vector of GWAS study names.
    - gwas_data (list): Vector of GWAS data.
    - column_mapping (list, optional): Vector of column mapping files.
    - region_name (list, optional): List of region names (same as in load_regional_data).
    - region_list (str, required): File path to a file containing region mapping between names and coordinates (same file used in load_regional_data).

    Returns:
    - GWAS Dictionary: Maps study IDs to a list containing chromosome number, 
      GWAS file path, and optional column mapping file path.
    - Region Dictionary: Maps region names to lists [chr, start, end] using the same
      key format as regional_data["meta_info"][1] (chr:start-end).

    Raises:
    - FileNotFoundError: If any specified file path does not exist.
    - ValueError: If required columns are missing in the input files or vector lengths mismatch.
    """
    # Check vector lengths
    if len(gwas_name) != len(gwas_data):
        raise ValueError("gwas_name and gwas_data must be of equal length")
    
    if len(column_mapping) > 0 and len(column_mapping) != len(gwas_name):
        raise ValueError("If column_mapping is provided, it must be of the same length as gwas_name and gwas_data")

    # Required columns for GWAS file type
    required_gwas_columns = ['study_id', 'chrom', 'file_path']

    # Base directory of the metadata files
    gwas_base_dir = os.path.dirname(gwas_meta_data)
    
    # Reading the GWAS metadata file
    gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
    check_required_columns(gwas_df, required_gwas_columns)
    gwas_dict = OrderedDict()

    # Handle ld_data_name for study_id modifications
    if len(ld_data_name) == 1:
        # If one LD name provided, apply to all studies
        gwas_df['study_id'] = ld_data_name[0] + '_' + gwas_df['study_id']
    elif len(ld_data_name) > 1:
        # If multiple LD names provided, ensure they match the number of studies
        if len(ld_data_name) != gwas_df.shape[0]:
            raise ValueError(f"ld_data_name length ({len(ld_data_name)}) must match the number of studies ({gwas_df.shape[0]})")
        # Apply specific LD name to each study using vectorized operations
        gwas_df['study_id'] = [f"{ld}_{sid}" for ld, sid in zip(ld_data_name, gwas_df['study_id'])]
        ###### FIXME: if multiple LD names provided, ensure they match the number of studies also match the ld_meta_data.
        ###### FIXME: this is also need a match file, ld_meta_data should be unique then save computational for loading LD.
        
    # Process additional GWAS data from vectors
    for name, data, mapping in zip(gwas_name, gwas_data, column_mapping or [None]*len(gwas_name)):
        gwas_dict[name] = {0: [data, mapping]}

    for _, row in gwas_df.iterrows():
        file_path = row['file_path']
        mapping_file = row.get('column_mapping_file')
        n_sample = row.get('n_sample')
        n_case = row.get('n_case')
        n_control = row.get('n_control')

        # Check if the file and optional mapping file exist
        if not file_exists(file_path, gwas_base_dir) or (mapping_file and not file_exists(mapping_file, gwas_base_dir)):
            raise FileNotFoundError(f"File {file_path} not found for {row['study_id']}")
        
        # Adjust paths if necessary
        file_path = file_path if file_exists(file_path) else os.path.join(gwas_base_dir, file_path)
        if mapping_file:
            mapping_file = mapping_file if file_exists(mapping_file) else os.path.join(gwas_base_dir, mapping_file)
        
        # Create or update the entry for the study_id
        if row['study_id'] not in gwas_dict:
            gwas_dict[row['study_id']] = {}

        # Expand chrom 0 to chrom 1-22 or use the specified chrom
        chrom_range = range(1, 23) if row['chrom'] == 0 else [row['chrom']]
        for chrom in chrom_range:
            if chrom in gwas_dict[row['study_id']]:
                existing_entry = gwas_dict[row['study_id']][chrom]
                raise ValueError(f"Duplicate chromosome specification for study_id {row['study_id']}, chrom {chrom}. "
                                 f"Conflicting entries: {existing_entry} and {[file_path, mapping_file]}")
            gwas_dict[row['study_id']][chrom] = [file_path, mapping_file, n_sample, n_case, n_control]

    # Process region_list and region_name
    region_dict = dict()
 
    if region_name:
        for region in region_name:
            try:
                # First try parsing the region as a coordinate
                parsed_region = parse_region(region)
                region_key = f"chr{parsed_region[0]}:{parsed_region[1]}-{parsed_region[2]}"
                if region_key not in region_dict:
                    region_dict[region_key] = parsed_region
            except ValueError:
                # If that fails, it means it's a name that needs to be resolved
                resolved_regions = resolve_region_names([region], customized_association_windows, region_list)
                region_dict.update(resolved_regions)
    else:
        # Analyze all regions found in region list
        with open(region_list, 'r') as file:
            for line in file:
                # Skip empty lines and comments
                if not line.strip() or line.startswith("#"):
                    continue
                    
                parts = line.strip().split()
                if len(parts) == 1:
                    # Format: "chr:start-end"
                    region = parse_region(parts[0])
                elif len(parts) == 3:
                    # Format: chr start end
                    chrom = parts[0]
                    # Ensure 'chr' prefix is preserved
                    if not chrom.startswith('chr'):
                        chrom = f"chr{chrom}"
                    region = [chrom, int(parts[1]), int(parts[2])]
                elif len(parts) >= 4:
                    # Format for eQTL: chr start end gene_id gene_name ...
                    chrom = parts[0]
                    # Ensure 'chr' prefix is preserved
                    if not chrom.startswith('chr'):
                        chrom = f"chr{chrom}"
                    region = [chrom, int(parts[1]), int(parts[2]), parts[3]]
                else:
                    raise ValueError(f"Invalid region format in region_list: {line.strip()}")
                
                # Create key in the same format as regional_data["meta_info"][1]
                region_key = f"chr{region[0]}:{region[1]}-{region[2]}"
                region_dict[region_key] = region

    return gwas_dict, region_dict

if gwas_meta_data.is_file():
    # Load GWAS data with region keys formatted to match regional_data["meta_info"]
    # Using the same region_list and region_name as load_regional_data ensures consistency
    gwas_dict, region_dict = load_regional_rss_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, ld_data_name, region_name, region_list)
    regional_rss_data = dict([("GWAS", gwas_dict), ("regions", region_dict)])
    
    # The keys in regional_rss_data["regions"] are now formatted as "chr:start-end"
    # to match the second element of each tuple in regional_data["meta_info"],
    # with both formats ensuring the chromosome includes the 'chr' prefix.
    # Since we use the same region_list and region_name parameters for both 
    # load_regional_data and load_regional_rss_data, we guarantee consistent
    # region identification across both data structures.
    # The expected format for shared keys is: "chr1:12345-67890"
else:
    regional_rss_data = dict()

ColocBoost analysis#

[colocboost]
depends: sos_variable("regional_data"), sos_variable("regional_rss_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"

base_dir = f'{cwd:a}'
chr_info = _meta_info[0].split(":")[0]
gene_id = _meta_info[2]
base_filename = f'{name}.{chr_info}_{gene_id}'

# Initialize output_files list
output_files = []

# Default case: if all analysis flags are False, save data
if not xqtl_coloc and not joint_gwas and not separate_gwas:
    save_data = True

if save_data:
    output_files.append(f'{base_dir}/data/{base_filename}.cb_data.rds')

if xqtl_coloc:
    output_files.append(f'{base_dir}/colocboost/{base_filename}.cb_xqtl.rds')

if joint_gwas:
    output_files.append(f'{base_dir}/colocboost/{base_filename}.cb_xqtl_joint_gwas.rds')

if separate_gwas and "GWAS" in regional_rss_data and regional_rss_data["GWAS"]:
    for study_name in regional_rss_data["GWAS"].keys():
        output_files.append(f'{base_dir}/colocboost/{base_filename}.cb_xqtl_{study_name}.rds')
   
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stdout = f"{_output[0]:nn}.colocboost.stdout", stderr = f"{_output[0]:nn}.colocboost.stderr", container = container, entrypoint = entrypoint
    options(warn=1)
    
    library(colocboost)
    library(pecotmr)
    library(dplyr)
  
    ##### ======= set up all input parameters ====== #####
  
    # individual level data
    region = ${("'%s'" % _meta_info[0]) if int(_meta_info[0].split('-')[-1])>0 else 'NULL'} # if the end position is zero return NULL
    genotype_list = ${_input[0]:anr}
    phenotype_list = c(${",".join(['"%s"' % x.absolute() for x in _input[1:len(_input)//2+1]])})
    covariate_list = c(${",".join(['"%s"' % x.absolute() for x in _input[len(_input)//2+1:]])})
    conditions_list_individual = c(${",".join(['"%s"' % x for x in _meta_info[4:]])})
    maf_cutoff = ${maf}
    mac_cutoff = ${mac}
    imiss_cutoff = ${imiss}
    association_window = "${_meta_info[1]}"
    extract_region_name = list(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])})
    region_name_col = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"}
    keep_indel = ${"TRUE" if indel else "FALSE"}
    # extract subset of samples
    keep_samples = NULL
    if (${"TRUE" if keep_samples.is_file() else "FALSE"}) {
      keep_samples = unlist(strsplit(readLines(${keep_samples:ar}), "\\s+"))
      message(paste(length(keep_samples), "samples are selected to be loaded for analysis"))
    }
    phenotype_header = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"}
    scale_residuals = FALSE
    keep_variants = ${'"%s"' % keep_variants if not keep_variants.is_dir() else "NULL"} # use "not keep_variants.is_dir()" in case user input filename is wrong without realizing
    
    # - summary statistics
    sumstat_path_list = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][0] for x in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else []):r,})
    column_file_path_list = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][1] for x in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else []):r,})
    LD_meta_file_path_list = "${ld_meta_data}"
    # Replace _regions with _meta_info[1] which is the association window region
    n_samples = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][2] for x in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else []):r,}) %>% as.numeric
    n_cases = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][3] for x in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else []):r,}) %>% as.numeric
    n_controls = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][4] for x in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else []):r,}) %>% as.numeric
    if("${extract_sumstats_region_name}" == "NULL"){
        extract_sumstats_region_name = NULL
    } else {
        extract_sumstats_region_name = "${extract_sumstats_region_name}"
    }
    sumstats_region_name_col = ${sumstats_region_name_col}
    comment_string = ${"NULL" if comment_string == "NULL" else f"'{comment_string}'"}
    qc_method = ${"NULL" if qc_method == "NULL" else "'%s'" % qc_method}
    impute = ${"TRUE" if impute else "FALSE"}
    impute_opts = list(rcond = ${rcond}, R2_threshold = ${R2_threshold}, minimum_ld = ${minimum_ld}, lamb=${lamb})
    ### FIXME: there is no remove_indels in rss analysis pipline should we ignore this?
    remove_indels = FALSE
  
    # Summary stats data - Using the 2nd element (association window) from _meta_info directly
    studies = c(${', '.join(['"{}"'.format(item) for item in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else [])})
    ld_data_name <- c(${', '.join(['"{}"'.format(item) for item in ld_data_name] if "GWAS" in regional_rss_data else [])})

    ### FIXME when we have multiple LD reference in the future we need to revisit this
    match_LD_sumstat <- list()
    # If we have any LD data names
    if (length(ld_data_name) > 0) {
      if (length(ld_data_name) == 1) {
        match_LD_sumstat[[ld_data_name[1]]] <- studies
      } else {
        # Otherwise, match studies to their respective LD references
        # Assuming 1:1 ordered matching which is completed due to the previous python codes preprocessed
        for (ld in unique(ld_data_name)) {
          # Find which positions have this LD name
          matches <- which(ld_data_name == ld)
          # Assign matching studies to this LD
          match_LD_sumstat[[ld]] <- studies[matches]
        }
      }
    }
    # About analysis
    xqtl_coloc = ${"TRUE" if xqtl_coloc else "FALSE"}
    joint_gwas = ${"TRUE" if joint_gwas else "FALSE"}
    separate_gwas = ${"TRUE" if separate_gwas else "FALSE"}
    # About QC parameters
    pip_cutoff_to_skip_ind = unlist(list(${",".join(skip_analysis_pip_cutoff)})[conditions_list_individual])
    pip_cutoff_to_skip_sumstat = rep(${skip_sumstats_analysis_pip_cutoff}, length(studies)) %>% setNames(studies)
    event_filter_rules = ${"NULL" if not event_filter_rules.is_file() else "'%s'" % event_filter_rules}
    if (!is.null(event_filter_rules)) {
        event_filters <- read.table("${event_filter_rules}")
        event_filters <- lapply(1:nrow(event_filters), function(ii) as.list(event_filters[ii,] %>% unlist))
    } else { event_filters <- NULL }
    
    # Add computational time
    computing_time = list()
    # Load regional multitask data
    t1 <- Sys.time()
    tryCatch({
        fdat = load_multitask_regional_data(region = region, 
                                            # individual
                                            genotype_list = genotype_list,
                                            phenotype_list = phenotype_list, 
                                            covariate_list = covariate_list, 
                                            conditions_list_individual = conditions_list_individual,
                                            maf_cutoff = maf_cutoff, mac_cutoff = mac_cutoff,
                                            imiss_cutoff = imiss_cutoff,
                                            association_window = association_window, 
                                            extract_region_name = extract_region_name,
                                            region_name_col = region_name_col,
                                            keep_indel = keep_indel, keep_samples = keep_samples,
                                            phenotype_header = phenotype_header, scale_residuals = scale_residuals,
                                            keep_variants = keep_variants,
                                            # summary stats
                                            sumstat_path_list = sumstat_path_list,
                                            column_file_path_list = column_file_path_list,
                                            LD_meta_file_path_list = LD_meta_file_path_list,
                                            match_LD_sumstat = match_LD_sumstat,
                                            conditions_list_sumstat = studies,
                                            n_samples = n_samples, n_cases = n_cases, n_controls = n_controls,
                                            extract_sumstats_region_name = extract_sumstats_region_name, sumstats_region_name_col = sumstats_region_name_col,
                                            comment_string = comment_string)
    }, NoSNPsError = function(e) {
        message("Error: ", paste(e$message, "${_meta_info[2] + '@' + _meta_info[1]}"))
        quit(save="no")
    })   
    t2 <- Sys.time()
    computing_time$Loading = t2 - t1

  
    # save data-set
    if (${"TRUE" if save_data else "FALSE"}) {
        saveRDS(list(${_meta_info[2]} = fdat), "${_output[0]:ann}.cb_data.rds", compress='xz')
    }
    if ("${_meta_info[2]}" != "${_meta_info[3]}") {
        region_name = c("${_meta_info[2]}", c(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])}))
    } else {
        region_name = "${_meta_info[2]}"
    }
  
    # - analysis parameters
    region_info = list(region_coord=parse_region("${_meta_info[0]}"), grange=parse_region("${_meta_info[1]}"), region_name=region_name)
    res <- colocboost_analysis_pipeline(region_data = fdat, 
                                        event_filters = event_filters,
                                        # - analysis
                                        xqtl_coloc = xqtl_coloc,
                                        joint_gwas = joint_gwas,
                                        separate_gwas = separate_gwas,
                                        # - individual QC
                                        maf_cutoff = maf_cutoff, 
                                        pip_cutoff_to_skip_ind = pip_cutoff_to_skip_ind,
                                        # - sumstat QC
                                        pip_cutoff_to_skip_sumstat = pip_cutoff_to_skip_sumstat,
                                        qc_method = qc_method,
                                        impute = impute, 
                                        impute_opts = impute_opts)

    # Reorganize outputs
    computing_time <- c(computing_time, res$computing_time)
    if (xqtl_coloc) {
      res$xqtl_coloc$region_info <- region_info
      res$xqtl_coloc$computing_time <- computing_time
      saveRDS(list("${_meta_info[2]}" = res$xqtl_coloc), "${_output[0]:ann}.cb_xqtl.rds", compress='xz')
    }
    if (joint_gwas) {
      res$joint_gwas$region_info <- region_info
      res$joint_gwas$computing_time <- computing_time
      saveRDS(list("${_meta_info[2]}" = res$joint_gwas), "${_output[0]:ann}.cb_xqtl_joint_gwas.rds", compress='xz')
    }
    if (separate_gwas) {
      for (name in names(res$separate_gwas)) {
          res$separate_gwas[[name]]$region_info <- region_info
          res$separate_gwas[[name]]$computing_time <- computing_time
          saveRDS(list("${_meta_info[2]}" = res$separate_gwas[[name]]), paste0("${_output[0]:ann}.cb_xqtl_", name, ".rds"), compress='xz')
        }
    }