Multi-trait colocalization using ColocBoost#
Description#
This notebook performs colocalizatoin for multiple xQTL in a given region, with/without GWAS summary statistics. Current protocol (Feb, 2025) supports individual level xQTL data from the same cohort (multiple phenotypes, same genotype) along with summary statistics and companion LD reference data.
Input#
A list of regions to be analyzed (optional); the last column of this file should be region name.
Either a list of per chromosome genotype files, or one file for genotype data of the entire genome. Genotype data has to be in PLINK
bed
format.Vector of lists of phenotype files per region to be analyzed, in UCSC
bed.gz
with index inbed.gz.tbi
formats.Vector of covariate files corresponding to the lists above.
Customized association windows file for variants (cis or trans). If it is not provided, a fixed sized window will be used around the region (a cis-window)
Optionally a vector of names of the phenotypic conditions in the form of
cond1 cond2 cond3
separated with whitespace.Optionally summary statistics meta-data file and LD reference meta-data file.
Input 2 and 3 should be outputs from genotype_per_region
and annotate_coord
modules in previous preprocessing steps. 4 should be output of covariate_preprocessing
pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates.
Example genotype, phenotype and association analysis windows#
See this page for example inputs of these information.
Example summary statistics and LD reference#
See this page for example inputs of these information.
About indels#
Option --no-indel
will remove indel from analysis. FIXME: Gao need to provide more guidelines how to deal with indels in practice.
Output#
For each analysis region, the output are various ColocBoost models fitted and saved in RDS format.
Minimal Working Example Steps#
Timing [FIXME]
xQTL only anlaysis#
Below we duplicate the examples for phenotype and covariates to demonstrate that when there are multiple phenotypes for the same genotype it is possible to use this pipeline to analyze all of them (more than two is accepted as well).
Here using --region-name
we focus the analysis on 3 genes. In practice if this parameter is dropped, the union of all regions in all phenotype region lists will be analyzed. It is possible for some of the regions there are no genotype data, in which case the pipeline will output RDS files with a warning message to indicate the lack of genotype data to analyze.
Note: Suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk.
sos run pipeline/colocboost.ipynb colcoboost \
--name protocol_example_protein \
--genoFile input/xqtl_association/protocol_example.genotype.chr21_22.bed \
--phenoFile output/phenotype/protocol_example.protein.region_list.txt \
output/phenotype/protocol_example.protein.region_list.txt \
--covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
--customized-association-windows input/xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
--no-separate-gwas --xqtl-coloc \
--region-name ENSG00000241973_P42356 ENSG00000160209_O00764 ENSG00000100412_Q99798 \
--phenotype-names trait_A trait_B
It is also possible to analyze a selected list of regions using option --region-list
. The last column of this file will be used for the list to analyze. Here for example use the same list of regions as we used for customized association-window:
sos run pipeline/colocboost.ipynb colcoboost \
--name protocol_example_protein \
--genoFile input/xqtl_association/protocol_example.genotype.chr21_22.bed \
--phenoFile output/phenotype/protocol_example.protein.region_list.txt \
output/phenotype/protocol_example.protein.region_list.txt \
--covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
--customized-association-windows input/xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
--no-separate-gwas --xqtl-coloc \
--region-list xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
--phenotype-names trait_A trait_B
Note: When both --region-name
and --region-list
are used, the union of regions from these parameters will be analyzed.
GWAS-xQTL integration#
sos run pipeline/colocboost.ipynb colcoboost \
--name protocol_example_protein \
--genoFile input/xqtl_association/protocol_example.genotype.chr21_22.bed \
--phenoFile output/phenotype/protocol_example.protein.region_list.txt \
output/phenotype/protocol_example.protein.region_list.txt \
--covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
--customized-association-windows input/xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
--ld-meta-data input/LD_matrix/ld_meta_file.tsv \
--gwas-meta-data input/GWAS_sumstat_meta.tsv \
--separate-gwas --xqtl-coloc \
--region-name ENSG00000241973_P42356 ENSG00000160209_O00764 ENSG00000100412_Q99798 \
--phenotype-names trait_A trait_B
Troubleshooting#
Step |
Substep |
Problem |
Possible Reason |
Solution |
---|---|---|---|---|
Command interface#
sos run colocboost.ipynb -h
Workflow implementation#
[global]
# It is required to input the name of the analysis
parameter: name = str
parameter: cwd = path("output")
# A list of file paths for genotype data, or the genotype data itself.
parameter: genoFile = path
# One or multiple lists of file paths for phenotype data.
parameter: phenoFile = paths
# One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.
parameter: phenoIDFile = paths()
# Covariate file path
parameter: covFile = paths
# Summary statistics interface, see `rss_analysis.ipynb` for details
parameter: gwas_meta_data = path()
parameter: ld_meta_data = path()
parameter: gwas_name = []
parameter: gwas_data = []
parameter: column_mapping = []
# Optional: if a region list is provide the analysis will be focused on provided region.
# The LAST column of this list will contain the ID of regions to focus on when region_name is given
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# Only focus on a subset of samples
parameter: keep_samples = path()
# Only focus on a subset of variants
parameter: keep_variants = path()
# An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_association_windows = path()
# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# When this is set to negative, we will rely on using customized_association_windows
parameter: cis_window = -1
# save data object or not
parameter: save_data = False
# Name of phenotypes
parameter: phenotype_names = [f'{x:bn}' for x in phenoFile]
# And indicator whether it is trans-analysis, ie, not using phenotypic coordinate information
parameter: trans_analysis = False
parameter: seed = 999
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 1.0
# MAF and variance of X cutoff
parameter: maf = 0.0025
parameter: xvar_cutoff = 0.0
# MAC cutoff, on top of MAF cutoff
parameter: mac = 5
# Remove indels if indel = False
parameter: indel = True
parameter: event_filter_rules = path()
# If this value is not 0, then an initial single effect analysis will be performed
# to determine if follow up analysis will be continued or to simply return NULL
# If this is negative we use a default way to determine this cutoff which is conservative but still useful
parameter: skip_analysis_pip_cutoff = []
parameter: skip_sumstats_analysis_pip_cutoff = -1.0
# Perform xQTL colocalization
parameter: xqtl_coloc = True
# Perform joint colocalization with many traits
parameter: joint_gwas = False
# Perform separate GWAS targeted anlaysis one at a time
parameter: separate_gwas = True
# Whether to impute the sumstat for all the snp in LD but not in sumstat.
parameter: impute = False
parameter: rcond = 0.01
parameter: lamb = 0.01
parameter: R2_threshold = 0.6
parameter: minimum_ld = 5
# summary stats QC methods: rss_qc, dentist, slalom
parameter: qc_method = "NULL"
parameter: extract_sumstats_region_name = "NULL"
# Either an index number in str format or "NULL" as a string
parameter: sumstats_region_name_col = "NULL"
parameter: comment_string = "NULL"
parameter: ld_data_name = []
# Analysis environment settings
parameter: container = ""
parameter: entrypoint= ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "1h"
# Memory expected
parameter: mem = "20G"
# Number of threads
parameter: numThreads = 1
if len(phenoFile) != len(covFile):
raise ValueError("Number of input phenotypes files must match that of covariates files")
if len(phenoFile) != len(phenotype_names):
raise ValueError("Number of input phenotypes files must match the number of phenotype names")
if len(phenoIDFile) > 0 and len(phenoFile) != len(phenoIDFile):
raise ValueError("Number of input phenotypes files must match the number of phenotype ID mapping files")
if len(skip_analysis_pip_cutoff) == 0:
skip_analysis_pip_cutoff = [0.0] * len(phenoFile)
if len(skip_analysis_pip_cutoff) == 1:
skip_analysis_pip_cutoff = skip_analysis_pip_cutoff * len(phenoFile)
if len(skip_analysis_pip_cutoff) != len(phenoFile):
raise ValueError(f"``skip_analysis_pip_cutoff`` should have either length 1 or length the same as phenotype files ({len(phenoFile)} in this case)")
# make it into an R List string
skip_analysis_pip_cutoff = [f"'{y}'={x}" for x,y in zip(skip_analysis_pip_cutoff, phenotype_names)]
def group_by_region(lst, partition):
# from itertools import accumulate
# partition = [len(x) for x in partition]
# Compute the cumulative sums once
# cumsum_vector = list(accumulate(partition))
# Use slicing based on the cumulative sums
# return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
return partition
import os
import pandas as pd
def adapt_file_path(file_path, reference_file):
"""
Adapt a single file path based on its existence and a reference file's path.
Args:
- file_path (str): The file path to adapt.
- reference_file (str): File path to use as a reference for adaptation.
Returns:
- str: Adapted file path.
Raises:
- FileNotFoundError: If no valid file path is found.
"""
reference_path = os.path.dirname(reference_file)
# Check if the file exists
if os.path.isfile(file_path):
return file_path
# Check file name without path
file_name = os.path.basename(file_path)
if os.path.isfile(file_name):
return file_name
# Check file name in reference file's directory
file_in_ref_dir = os.path.join(reference_path, file_name)
if os.path.isfile(file_in_ref_dir):
return file_in_ref_dir
# Check original file path prefixed with reference file's directory
file_prefixed = os.path.join(reference_path, file_path)
if os.path.isfile(file_prefixed):
return file_prefixed
# If all checks fail, raise an error
raise FileNotFoundError(f"No valid path found for file: {file_path}")
def adapt_file_path_all(df, column_name, reference_file):
return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))
[get_analysis_regions: shared = "regional_data"]
# input is genoFile, phenoFile, covFile and optionally region_list. If region_list presents then we only analyze what's contained in the list.
# regional_data should be a dictionary like:
#{'data': [("genotype_1.bed", "phenotype_1.bed.gz", "covariate_1.gz"), ("genotype_2.bed", "phenotype_1.bed.gz", "phenotype_2.bed.gz", "covariate_1.gz", "covariate_2.gz") ... ],
# 'meta_info': [("chr12:752578-752579","chr12:752577-752580", "gene_1", "trait_1"), ("chr13:852580-852581","chr13:852579-852580", "gene_2", "trait_1", "trait_2") ... ]}
import numpy as np
def preload_id_map(id_map_files):
id_maps = {}
missing_files = []
for id_map_file in id_map_files:
if id_map_file is not None:
if os.path.isfile(id_map_file):
df = pd.read_csv(id_map_file, sep=r"\s+", header=None, comment='#', names=['old_ID', 'new_ID'])
id_maps[id_map_file] = df.set_index('old_ID')['new_ID'].to_dict()
else:
missing_files.append(id_map_file)
# If there are missing files, print a summary
# if missing_files:
# print(f"Warning: {len(missing_files)} ID map file(s) specified but not found:")
# for file in missing_files:
# print(f" - {file}")
return id_maps
def load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps):
pheno_df = pd.read_csv(pheno_path, sep=r"\s+", header=0)
pheno_df['Original_ID'] = pheno_df['ID']
# Check if id_map_path is specified
if id_map_path is not None:
# Check if the id_map_path exists in preloaded maps
if id_map_path in preloaded_id_maps:
id_map = preloaded_id_maps[id_map_path]
pheno_df['ID'] = pheno_df['ID'].map(id_map).fillna(pheno_df['ID'])
else:
# ID map file was specified but doesn't exist or wasn't loaded
print(f"Warning: ID map file '{id_map_path}' was specified but does not exist. Using original IDs.")
# No mapping is done here, so ID remains the same as Original_ID
return pheno_df
def filter_by_region_ids(data, region_ids):
if region_ids is not None and len(region_ids) > 0:
data = data[data['ID'].isin(region_ids)]
return data
def custom_join(series):
# Initialize an empty list to hold the processed items
result = []
for item in series:
if ',' in item:
# If the item contains commas, split by comma and convert to tuple
result.append(tuple(item.split(',')))
else:
# If the item does not contain commas, add it directly
result.append(item)
# Convert the list of items to a tuple and return
return tuple(result)
def aggregate_phenotype_data(accumulated_pheno_df):
if not accumulated_pheno_df.empty:
accumulated_pheno_df = accumulated_pheno_df.groupby(['#chr','ID','cond','path','cov_path'], as_index=False).agg({
'#chr': lambda x: np.unique(x).astype(str)[0],
'ID': lambda x: np.unique(x).astype(str)[0],
'Original_ID': ','.join,
'start': 'min',
'end': 'max'
}).groupby(['#chr','ID'], as_index=False).agg({
'cond': ','.join,
'path': ','.join,
'Original_ID': custom_join,
'cov_path': ','.join,
'start': 'min',
'end': 'max'
})
return accumulated_pheno_df
def process_cis_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, preloaded_id_maps):
'''
Example output:
#chr start end ID Original_ID path cov_path cond
chr12 752578 752579 ENSG00000060237 Q9H4A3,P62873 protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz covar_1.gz,covar_2.gz trait_A,trait_B
'''
accumulated_pheno_df = pd.DataFrame()
pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files
for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
if not os.path.isfile(cov_path):
raise FileNotFoundError(f"No valid path found for file: {cov_path}")
pheno_df = load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps)
pheno_df = filter_by_region_ids(pheno_df, region_ids)
if not pheno_df.empty:
pheno_df.iloc[:, 4] = adapt_file_path_all(pheno_df, pheno_df.columns[4], f"{pheno_path:a}")
pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)
accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)
accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)
return accumulated_pheno_df
def process_trans_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, customized_association_windows):
'''
Example output:
#chr start end ID Original_ID path cov_path cond
chr21 0 0 chr21_18133254_19330300 carnitine,benzoate,hippurate metabolon_1.bed.gz,metabolon_2.bed.gz covar_1.gz,covar_2.gz trait_A,trait_B
'''
if not os.path.isfile(customized_association_windows):
raise ValueError("Customized association analysis window must be specified for trans analysis.")
accumulated_pheno_df = pd.DataFrame()
pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files
genotype_windows = pd.read_csv(customized_association_windows, comment="#", header=None, names=["#chr","start","end","ID"], sep="\t")
genotype_windows = filter_by_region_ids(genotype_windows, region_ids)
if genotype_windows.empty:
return accumulated_pheno_df
for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
if not os.path.isfile(cov_path):
raise FileNotFoundError(f"No valid path found for file: {cov_path}")
pheno_df = pd.read_csv(pheno_path, sep=r"\s+", header=0, names=['Original_ID', 'path'])
if not pheno_df.empty:
pheno_df.iloc[:, -1] = adapt_file_path_all(pheno_df, pheno_df.columns[-1], f"{pheno_path:a}")
pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)
# Here we combine genotype_windows which contains "#chr" and "ID" to pheno_df by creating a cartesian product
pheno_df = pd.merge(genotype_windows.assign(key=1), pheno_df.assign(key=1), on='key').drop('key', axis=1)
# then set start and end columns to zero
pheno_df['start'] = 0
pheno_df['end'] = 0
if id_map_path is not None:
# Filter pheno_df by specific association-window and phenotype pairs
association_analysis_pair = pd.read_csv(id_map_path, sep=r"\s+", header=None, comment='#', names=['ID', 'Original_ID'])
pheno_df = pd.merge(pheno_df, association_analysis_pair, on=['ID', 'Original_ID'])
accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)
accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)
return accumulated_pheno_df
def load_regional_data(genoFile, phenoFile, covFile, phenotype_names, phenoIDFile,
region_list=None, region_name=None, trans_analysis=False,
customized_association_windows=None, cis_window=-1):
# Ensure region_name is a list
if region_name is None:
region_name = []
# Load genotype meta data
if f"{genoFile:x}" == ".bed":
geno_meta_data = pd.DataFrame([("chr"+str(x), f"{genoFile:a}") for x in range(1,23)] + [("chrX", f"{genoFile:a}")], columns=['#chr', 'geno_path'])
else:
geno_meta_data = pd.read_csv(f"{genoFile:a}", sep=r"\s+", header=0)
geno_meta_data.iloc[:, 1] = adapt_file_path_all(geno_meta_data, geno_meta_data.columns[1], f"{genoFile:a}")
geno_meta_data.columns = ['#chr', 'geno_path']
geno_meta_data['#chr'] = geno_meta_data['#chr'].apply(lambda x: str(x) if str(x).startswith('chr') else f'chr{x}')
# Checking the DataFrame
valid_chr_values = [f'chr{x}' for x in range(1, 23)] + ['chrX']
if not all(value in valid_chr_values for value in geno_meta_data['#chr']):
raise ValueError("Invalid chromosome values found. Allowed values are chr1 to chr22 and chrX.")
region_ids = []
# If region_list is provided, read the file and extract IDs
if region_list is not None and region_list.is_file():
region_list_df = pd.read_csv(region_list, delim_whitespace=True, header=None, comment="#")
region_ids = region_list_df.iloc[:, -1].unique() # Extracting the last column for IDs
# If region_name is provided, include those IDs as well
# --region-name A B C will result in a list of ["A", "B", "C"] here
if len(region_name) > 0:
region_ids = list(set(region_ids).union(set(region_name)))
if trans_analysis:
meta_data = process_trans_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, customized_association_windows)
else:
meta_data = process_cis_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, preload_id_map(phenoIDFile))
if not meta_data.empty:
meta_data = meta_data.merge(geno_meta_data, on='#chr', how='inner')
# Adjust association-window
if customized_association_windows is not None and os.path.isfile(customized_association_windows):
print(f"Loading customized association analysis window from {customized_association_windows}")
association_windows_list = pd.read_csv(customized_association_windows, comment="#", header=None,
names=["#chr","start","end","ID"], sep="\t")
meta_data = pd.merge(meta_data, association_windows_list, on=['#chr', 'ID'], how='left', suffixes=('', '_association'))
mismatches = meta_data[meta_data['start_association'].isna()]
if not mismatches.empty:
print("First 5 mismatches:")
print(mismatches[['ID']].head())
raise ValueError(f"{len(mismatches)} regions to analyze cannot be found in ``{customized_association_windows}``. "
f"Please check your ``{customized_association_windows}`` database to make sure it contains all "
f"association-window definitions. ")
else:
if cis_window < 0:
raise ValueError("Please either input valid path to association-window file via ``--customized-association-windows``, "
"or set ``--cis-window`` to a non-negative integer.")
if cis_window == 0:
print("Warning: only variants within the range of start and end of molecular phenotype will be considered "
"since cis_window is set to zero and no customized association window file was found. "
"Please make sure this is by design.")
meta_data['start_association'] = meta_data['start'].apply(lambda x: max(x - cis_window, 0))
meta_data['end_association'] = meta_data['end'] + cis_window
# Example meta_data:
# #chr start end start_association end_association ID Original_ID path cov_path cond coordinate geno_path
# 0 chr12 752578 752579 652578 852579 ENSG00000060237 Q9H4A3,P62873 protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz covar_1.gz,covar_2.gz trait_A,trait_B chr12:752578-752579 protocol_example.genotype.chr21_22.bed
# Create the final dictionary
regional_data = {
'data': [(row['geno_path'], *row['path'].split(','), *row['cov_path'].split(',')) for _, row in meta_data.iterrows()],
'meta_info': [(f"{row['#chr']}:{row['start']}-{row['end']}", # this is the phenotypic region to extract data from
f"{row['#chr']}:{row['start_association']}-{row['end_association']}", # this is the association window region
row['ID'], row['Original_ID'], *row['cond'].split(',')) for _, row in meta_data.iterrows()]
}
else:
regional_data = {'data': [], 'meta_info': []}
return regional_data
# Modified main code section to document the key format alignment
regional_data = load_regional_data(
genoFile=genoFile,
phenoFile=phenoFile,
covFile=covFile,
phenotype_names=phenotype_names,
phenoIDFile=phenoIDFile,
region_list=region_list,
region_name=region_name,
trans_analysis=trans_analysis,
customized_association_windows=customized_association_windows,
cis_window=cis_window
)
[get_rss_analysis_regions: shared = "regional_rss_data"]
from collections import OrderedDict
def file_exists(file_path, relative_path=None):
"""Check if a file exists at the given path or relative to a specified path."""
if os.path.exists(file_path) and os.path.isfile(file_path):
return True
elif relative_path:
relative_file_path = os.path.join(relative_path, file_path)
return os.path.exists(relative_file_path) and os.path.isfile(relative_file_path)
return False
def check_required_columns(df, required_columns):
"""Check if the required columns are present in the dataframe."""
missing_columns = [col for col in required_columns if col not in df.columns]
if missing_columns:
raise ValueError(f"Missing required columns: {', '.join(missing_columns)}")
def parse_region(region):
"""
Parse a region string in 'chr:start-end' format into a list [chr, start, end].
Returns a list containing [chromosome_with_chr_prefix, start_position, end_position].
"""
# Handle both numeric and 'chr' prefixed chromosome formats
if ':' in region:
chrom, rest = region.split(':')
# Store original chrom value with 'chr' prefix preserved
original_chrom = chrom
# Strip 'chr' prefix if present (for internal processing)
if chrom.startswith('chr'):
chrom = chrom[3:] # Remove 'chr' prefix if present
# Handle both hyphen and underscore separators for start-end
if '-' in rest:
start, end = rest.split('-')
elif '_' in rest:
start, end = rest.split('_')
else:
raise ValueError(f"Invalid region format: {region}. Expected format: chr:start-end or chr:start_end")
# Return with the original chromosome format that includes 'chr' prefix
return [original_chrom, int(start), int(end)]
else:
raise ValueError(f"Invalid region format: {region}. Expected format: chr:start-end or chr:start_end")
def resolve_region_names(region_name, customized_association_windows=None, region_list=None):
"""
Resolve region names by looking them up in either customized_association_windows or region_list.
Returns a dictionary mapping region keys to parsed regions.
"""
region_dict = {}
# If no region names provided, return empty dict
if not region_name:
return region_dict
# Check if either file exists
if not ((customized_association_windows and os.path.isfile(customized_association_windows)) or
(region_list and os.path.isfile(region_list))):
raise ValueError("Either customized association window file or region list file must exist when region names are provided")
# Determine which file to use (prioritize customized_association_windows)
region_file = None
if customized_association_windows and os.path.isfile(customized_association_windows):
region_file = customized_association_windows
print(f"Using customized association window file: {region_file}")
elif region_list and os.path.isfile(region_list):
region_file = region_list
print(f"Using region list file: {region_file}")
# Load region definitions from the file
name_to_region_map = {}
with open(region_file, 'r') as file:
for line in file:
# Skip empty lines and comments
if not line.strip() or line.startswith("#"):
continue
parts = line.strip().split()
if len(parts) >= 4: # Ensure there's at least 4 columns
# Extract ID column (4th column)
region_id = parts[3]
# Extract chromosome, start, and end positions
chrom = parts[0].replace("chr", "") # Remove 'chr' prefix if present
try:
# Create the region info
parsed_region = [int(chrom) if chrom.isdigit() else chrom,
int(parts[1]),
int(parts[2])]
# Include the ID
parsed_region.append(parts[3])
# Add to the mapping
name_to_region_map[region_id] = parsed_region
except ValueError:
# Skip lines with non-integer positions
print(f"Warning: Skipping invalid region definition: {line.strip()}")
# Process each region name
for region in region_name:
# Check if this region name exists in our mapping
if region in name_to_region_map:
parsed_region = name_to_region_map[region]
else:
# Try to parse it as a coordinate string as a last resort
try:
parsed_region = parse_region(region)
print(f"Warning: Region '{region}' not found in file, parsing as coordinate string")
except ValueError:
raise ValueError(f"Could not parse region '{region}' and it was not found in the region file")
# Format the region key
region_key = f"chr{parsed_region[0]}:{parsed_region[1]}-{parsed_region[2]}"
if region_key not in region_dict:
region_dict[region_key] = parsed_region
return region_dict
def load_regional_rss_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, ld_data_name, region_name=None, region_list=None):
"""
Extracts data from GWAS metadata files and additional GWAS data provided.
Uses the same region information as load_regional_data to ensure consistent
region identification across both data structures.
IMPORTANT: This function processes the same regions as load_regional_data and
formats the region keys in regional_rss_data["regions"] to exactly match the
2nd element of regional_data["meta_info"] tuples (chr:start-end format).
Args:
- gwas_meta_data (str): File path to the GWAS metadata file.
- gwas_name (list): Vector of GWAS study names.
- gwas_data (list): Vector of GWAS data.
- column_mapping (list, optional): Vector of column mapping files.
- region_name (list, optional): List of region names (same as in load_regional_data).
- region_list (str, required): File path to a file containing region mapping between names and coordinates (same file used in load_regional_data).
Returns:
- GWAS Dictionary: Maps study IDs to a list containing chromosome number,
GWAS file path, and optional column mapping file path.
- Region Dictionary: Maps region names to lists [chr, start, end] using the same
key format as regional_data["meta_info"][1] (chr:start-end).
Raises:
- FileNotFoundError: If any specified file path does not exist.
- ValueError: If required columns are missing in the input files or vector lengths mismatch.
"""
# Check vector lengths
if len(gwas_name) != len(gwas_data):
raise ValueError("gwas_name and gwas_data must be of equal length")
if len(column_mapping) > 0 and len(column_mapping) != len(gwas_name):
raise ValueError("If column_mapping is provided, it must be of the same length as gwas_name and gwas_data")
# Required columns for GWAS file type
required_gwas_columns = ['study_id', 'chrom', 'file_path']
# Base directory of the metadata files
gwas_base_dir = os.path.dirname(gwas_meta_data)
# Reading the GWAS metadata file
gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
check_required_columns(gwas_df, required_gwas_columns)
gwas_dict = OrderedDict()
# Handle ld_data_name for study_id modifications
if len(ld_data_name) == 1:
# If one LD name provided, apply to all studies
gwas_df['study_id'] = ld_data_name[0] + '_' + gwas_df['study_id']
elif len(ld_data_name) > 1:
# If multiple LD names provided, ensure they match the number of studies
if len(ld_data_name) != gwas_df.shape[0]:
raise ValueError(f"ld_data_name length ({len(ld_data_name)}) must match the number of studies ({gwas_df.shape[0]})")
# Apply specific LD name to each study using vectorized operations
gwas_df['study_id'] = [f"{ld}_{sid}" for ld, sid in zip(ld_data_name, gwas_df['study_id'])]
###### FIXME: if multiple LD names provided, ensure they match the number of studies also match the ld_meta_data.
###### FIXME: this is also need a match file, ld_meta_data should be unique then save computational for loading LD.
# Process additional GWAS data from vectors
for name, data, mapping in zip(gwas_name, gwas_data, column_mapping or [None]*len(gwas_name)):
gwas_dict[name] = {0: [data, mapping]}
for _, row in gwas_df.iterrows():
file_path = row['file_path']
mapping_file = row.get('column_mapping_file')
n_sample = row.get('n_sample')
n_case = row.get('n_case')
n_control = row.get('n_control')
# Check if the file and optional mapping file exist
if not file_exists(file_path, gwas_base_dir) or (mapping_file and not file_exists(mapping_file, gwas_base_dir)):
raise FileNotFoundError(f"File {file_path} not found for {row['study_id']}")
# Adjust paths if necessary
file_path = file_path if file_exists(file_path) else os.path.join(gwas_base_dir, file_path)
if mapping_file:
mapping_file = mapping_file if file_exists(mapping_file) else os.path.join(gwas_base_dir, mapping_file)
# Create or update the entry for the study_id
if row['study_id'] not in gwas_dict:
gwas_dict[row['study_id']] = {}
# Expand chrom 0 to chrom 1-22 or use the specified chrom
chrom_range = range(1, 23) if row['chrom'] == 0 else [row['chrom']]
for chrom in chrom_range:
if chrom in gwas_dict[row['study_id']]:
existing_entry = gwas_dict[row['study_id']][chrom]
raise ValueError(f"Duplicate chromosome specification for study_id {row['study_id']}, chrom {chrom}. "
f"Conflicting entries: {existing_entry} and {[file_path, mapping_file]}")
gwas_dict[row['study_id']][chrom] = [file_path, mapping_file, n_sample, n_case, n_control]
# Process region_list and region_name
region_dict = dict()
if region_name:
for region in region_name:
try:
# First try parsing the region as a coordinate
parsed_region = parse_region(region)
region_key = f"chr{parsed_region[0]}:{parsed_region[1]}-{parsed_region[2]}"
if region_key not in region_dict:
region_dict[region_key] = parsed_region
except ValueError:
# If that fails, it means it's a name that needs to be resolved
resolved_regions = resolve_region_names([region], customized_association_windows, region_list)
region_dict.update(resolved_regions)
else:
# Analyze all regions found in region list
with open(region_list, 'r') as file:
for line in file:
# Skip empty lines and comments
if not line.strip() or line.startswith("#"):
continue
parts = line.strip().split()
if len(parts) == 1:
# Format: "chr:start-end"
region = parse_region(parts[0])
elif len(parts) == 3:
# Format: chr start end
chrom = parts[0]
# Ensure 'chr' prefix is preserved
if not chrom.startswith('chr'):
chrom = f"chr{chrom}"
region = [chrom, int(parts[1]), int(parts[2])]
elif len(parts) >= 4:
# Format for eQTL: chr start end gene_id gene_name ...
chrom = parts[0]
# Ensure 'chr' prefix is preserved
if not chrom.startswith('chr'):
chrom = f"chr{chrom}"
region = [chrom, int(parts[1]), int(parts[2]), parts[3]]
else:
raise ValueError(f"Invalid region format in region_list: {line.strip()}")
# Create key in the same format as regional_data["meta_info"][1]
region_key = f"chr{region[0]}:{region[1]}-{region[2]}"
region_dict[region_key] = region
return gwas_dict, region_dict
if gwas_meta_data.is_file():
# Load GWAS data with region keys formatted to match regional_data["meta_info"]
# Using the same region_list and region_name as load_regional_data ensures consistency
gwas_dict, region_dict = load_regional_rss_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, ld_data_name, region_name, region_list)
regional_rss_data = dict([("GWAS", gwas_dict), ("regions", region_dict)])
# The keys in regional_rss_data["regions"] are now formatted as "chr:start-end"
# to match the second element of each tuple in regional_data["meta_info"],
# with both formats ensuring the chromosome includes the 'chr' prefix.
# Since we use the same region_list and region_name parameters for both
# load_regional_data and load_regional_rss_data, we guarantee consistent
# region identification across both data structures.
# The expected format for shared keys is: "chr1:12345-67890"
else:
regional_rss_data = dict()
ColocBoost analysis#
[colocboost]
depends: sos_variable("regional_data"), sos_variable("regional_rss_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
base_dir = f'{cwd:a}'
chr_info = _meta_info[0].split(":")[0]
gene_id = _meta_info[2]
base_filename = f'{name}.{chr_info}_{gene_id}'
# Initialize output_files list
output_files = []
# Default case: if all analysis flags are False, save data
if not xqtl_coloc and not joint_gwas and not separate_gwas:
save_data = True
if save_data:
output_files.append(f'{base_dir}/data/{base_filename}.cb_data.rds')
if xqtl_coloc:
output_files.append(f'{base_dir}/colocboost/{base_filename}.cb_xqtl.rds')
if joint_gwas:
output_files.append(f'{base_dir}/colocboost/{base_filename}.cb_xqtl_joint_gwas.rds')
if separate_gwas and "GWAS" in regional_rss_data and regional_rss_data["GWAS"]:
for study_name in regional_rss_data["GWAS"].keys():
output_files.append(f'{base_dir}/colocboost/{base_filename}.cb_xqtl_{study_name}.rds')
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stdout = f"{_output[0]:nn}.colocboost.stdout", stderr = f"{_output[0]:nn}.colocboost.stderr", container = container, entrypoint = entrypoint
options(warn=1)
library(colocboost)
library(pecotmr)
##### ======= set up all input parameters ====== #####
# individual level data
region = ${("'%s'" % _meta_info[0]) if int(_meta_info[0].split('-')[-1])>0 else 'NULL'} # if the end position is zero return NULL
genotype_list = ${_input[0]:anr}
phenotype_list = c(${",".join(['"%s"' % x.absolute() for x in _input[1:len(_input)//2+1]])})
covariate_list = c(${",".join(['"%s"' % x.absolute() for x in _input[len(_input)//2+1:]])})
conditions_list_individual = c(${",".join(['"%s"' % x for x in _meta_info[4:]])})
maf_cutoff = ${maf}
mac_cutoff = ${mac}
imiss_cutoff = ${imiss}
association_window = "${_meta_info[1]}"
extract_region_name = list(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])})
region_name_col = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"}
keep_indel = ${"TRUE" if indel else "FALSE"}
# extract subset of samples
keep_samples = NULL
if (${"TRUE" if keep_samples.is_file() else "FALSE"}) {
keep_samples = unlist(strsplit(readLines(${keep_samples:ar}), "\\s+"))
message(paste(length(keep_samples), "samples are selected to be loaded for analysis"))
}
phenotype_header = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"}
scale_residuals = FALSE
keep_variants = ${'"%s"' % keep_variants if not keep_variants.is_dir() else "NULL"} # use "not keep_variants.is_dir()" in case user input filename is wrong without realizing
# - summary statistics
sumstat_path_list = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][0] for x in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else []):r,})
column_file_path_list = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][1] for x in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else []):r,})
LD_meta_file_path_list = "${ld_meta_data}"
# Replace _regions with _meta_info[1] which is the association window region
n_samples = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][2] for x in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else []):r,}) %>% as.numeric
n_cases = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][3] for x in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else []):r,}) %>% as.numeric
n_controls = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][4] for x in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else []):r,}) %>% as.numeric
if("${extract_sumstats_region_name}" == "NULL"){
extract_sumstats_region_name = NULL
} else {
extract_sumstats_region_name = "${extract_sumstats_region_name}"
}
sumstats_region_name_col = ${sumstats_region_name_col}
comment_string = ${"NULL" if comment_string == "NULL" else f"'{comment_string}'"}
qc_method = ${"NULL" if qc_method == "NULL" else "'%s'" % qc_method}
impute = ${"TRUE" if impute else "FALSE"}
impute_opts = list(rcond = ${rcond}, R2_threshold = ${R2_threshold}, minimum_ld = ${minimum_ld}, lamb=${lamb})
### FIXME: there is no remove_indels in rss analysis pipline should we ignore this?
remove_indels = FALSE
# Summary stats data - Using the 2nd element (association window) from _meta_info directly
studies = c(${', '.join(['"{}"'.format(item) for item in regional_rss_data["GWAS"].keys()] if "GWAS" in regional_rss_data else [])})
ld_data_name <- c(${', '.join(['"{}"'.format(item) for item in ld_data_name] if "GWAS" in regional_rss_data else [])})
### FIXME when we have multiple LD reference in the future we need to revisit this
match_LD_sumstat <- list()
# If we have any LD data names
if (length(ld_data_name) > 0) {
if (length(ld_data_name) == 1) {
match_LD_sumstat[[ld_data_name[1]]] <- studies
} else {
# Otherwise, match studies to their respective LD references
# Assuming 1:1 ordered matching which is completed due to the previous python codes preprocessed
for (ld in unique(ld_data_name)) {
# Find which positions have this LD name
matches <- which(ld_data_name == ld)
# Assign matching studies to this LD
match_LD_sumstat[[ld]] <- studies[matches]
}
}
}
# About analysis
xqtl_coloc = ${"TRUE" if xqtl_coloc else "FALSE"}
joint_gwas = ${"TRUE" if joint_gwas else "FALSE"}
separate_gwas = ${"TRUE" if separate_gwas else "FALSE"}
# About QC parameters
pip_cutoff_to_skip_ind = unlist(list(${",".join(skip_analysis_pip_cutoff)})[conditions_list_individual])
pip_cutoff_to_skip_sumstat = rep(${skip_sumstats_analysis_pip_cutoff}, length(studies)) %>% setNames(studies)
event_filter_rules = ${"NULL" if not event_filter_rules.is_file() else "'%s'" % event_filter_rules}
if (!is.null(event_filter_rules)) {
event_filters <- read.table("${event_filter_rules}")
event_filters <- lapply(1:nrow(event_filters), function(ii) as.list(event_filters[ii,] %>% unlist))
} else { event_filters <- NULL }
# Add computational time
computing_time = list()
# Load regional multitask data
t1 <- Sys.time()
tryCatch({
fdat = load_multitask_regional_data(region = region,
# individual
genotype_list = genotype_list,
phenotype_list = phenotype_list,
covariate_list = covariate_list,
conditions_list_individual = conditions_list_individual,
maf_cutoff = maf_cutoff, mac_cutoff = mac_cutoff,
imiss_cutoff = imiss_cutoff,
association_window = association_window,
extract_region_name = extract_region_name,
region_name_col = region_name_col,
keep_indel = keep_indel, keep_samples = keep_samples,
phenotype_header = phenotype_header, scale_residuals = scale_residuals,
keep_variants = keep_variants,
# summary stats
sumstat_path_list = sumstat_path_list,
column_file_path_list = column_file_path_list,
LD_meta_file_path_list = LD_meta_file_path_list,
match_LD_sumstat = match_LD_sumstat,
conditions_list_sumstat = studies,
n_samples = n_samples, n_cases = n_cases, n_controls = n_controls,
extract_sumstats_region_name = extract_sumstats_region_name, sumstats_region_name_col = sumstats_region_name_col,
comment_string = comment_string)
}, NoSNPsError = function(e) {
message("Error: ", paste(e$message, "${_meta_info[2] + '@' + _meta_info[1]}"))
quit(save="no")
})
t2 <- Sys.time()
computing_time$Loading = t2 - t1
# save data-set
if (${"TRUE" if save_data else "FALSE"}) {
saveRDS(list(${_meta_info[2]} = fdat), "${_output[0]:ann}.cb_data.rds", compress='xz')
}
if ("${_meta_info[2]}" != "${_meta_info[3]}") {
region_name = c("${_meta_info[2]}", c(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])}))
} else {
region_name = "${_meta_info[2]}"
}
# - analysis parameters
region_info = list(region_coord=parse_region("${_meta_info[0]}"), grange=parse_region("${_meta_info[1]}"), region_name=region_name)
res <- colocboost_analysis_pipeline(region_data = fdat,
event_filters = event_filters,
# - analysis
xqtl_coloc = xqtl_coloc,
joint_gwas = joint_gwas,
separate_gwas = separate_gwas,
# - individual QC
maf_cutoff = maf_cutoff,
pip_cutoff_to_skip_ind = pip_cutoff_to_skip_ind,
# - sumstat QC
pip_cutoff_to_skip_sumstat = pip_cutoff_to_skip_sumstat,
qc_method = qc_method,
impute = impute,
impute_opts = impute_opts)
# Reorganize outputs
computing_time <- c(computing_time, res$computing_time)
if (xqtl_coloc) {
res$xqtl_coloc$region_info <- region_info
res$xqtl_coloc$computing_time <- computing_time
saveRDS(list("${_meta_info[2]}" = res$xqtl_coloc), "${_output[0]:ann}.cb_xqtl.rds", compress='xz')
}
if (joint_gwas) {
res$joint_gwas$region_info <- region_info
res$joint_gwas$computing_time <- computing_time
saveRDS(list("${_meta_info[2]}" = res$joint_gwas), "${_output[0]:ann}.cb_xqtl_joint_gwas.rds", compress='xz')
}
if (separate_gwas) {
for (name in names(res$separate_gwas)) {
res$separate_gwas[[name]]$region_info <- region_info
res$separate_gwas[[name]]$computing_time <- computing_time
saveRDS(list("${_meta_info[2]}" = res$separate_gwas[[name]]), paste0("${_output[0]:ann}.cb_xqtl_", name, ".rds"), compress='xz')
}
}