xQTL-GWAS pairwise enrichment and colocalization#
Description [FIXME]#
This workflow processes fine-mapping results for xQTL, generated by susie_twas
in the mnm_regression.ipynb
notebook for cis xQTL, and GWAS fine-mapping results produced by susie_rss
in the rss_analysis.ipynb
notebook. It is designed to perform enrichment and colocalization analysis, particularly when fine-mapping results originate from different regions in the case of cis-xQTL and GWAS. The pipeline is capable to integrate and analyze data across these distinct regions. Originally tailored for cis-xQTL and GWAS integration, this pipeline can be applied to other pairwise integrations. An example of such application is in trans analysis, where the fine-mapped regions might be identical between trans-xQTL and GWAS, representing a special case of this broader implementation.
Input#
Lists of SuSiE fine-mapping output objects, in RDS format, of class(susie)
in R.
For xQTL the list is meta-data of format:
chr
,start
,end
,original_data
,conditions_top_loci
,block_top_loci
whereoriginal_data
is an RDS file,conditions_top_loci
is showing which contexts have top loci table (potential signals)block_top_loci
is the blocks have overlapped top loci variant with xQTL data. That file could be output fromfine_mapping_post_processing/overlap_qtl_gwas
.For GWAS the list is meta-data of format:
chr
,start
,end
,original_data
,block_top_loci...
whereoriginal_data
is an RDS file. That file could be output fromfine_mapping_post_processing/gwas_results_export
.Context meta is a metafile that shows the analysis_names and the contained contexts. It can be used to identify the corresponding raw data for each context.
--gwas_meta_data
output fromfine_mapping_post_processing/gwas_results_export
#chr start end region_id TSS original_data combined_data combined_data_sumstats conditions conditions_top_loci
chr1 101384274 104443097 1_101384274-104443097 NA 1_101384274-104443097.susie_rss.rds gwas.1_101384274-104443097.cis_results_db.export.rds gwas.1_101384274-104443097.cis_results_db.export_sumstats.rds NA NA
chr19 44935906 46842901 19_44935906-46842901 NA 19_44935906-46842901.susie_rss.rds gwas.19_44935906-46842901.cis_results_db.export.rds gwas.19_44935906-46842901.cis_results_db.export_sumstats.rds AD_Bellenguez_2022,AD_Jansen_2021,AD_Kunkle_Stage1_2019,AD_Wightman_Full_2021,AD_Wightman_Excluding23andMe_2021,AD_Wightman_ExcludingUKBand23andME_2021 AD_Wightman_ExcludingUKBand23andME_2021.qc_impute,AD_Wightman_ExcludingUKBand23andME_2021.qc_only AD_Wightman_ExcludingUKBand23andME_2021.qc_impute
--xqtl_meta_data
for enrichment: output from fine_mapping_post_processing/cis_results_export
; for coloc: output from fine_mapping_post_processing/overlap_qtl_gwas
, the difference between them is without or with last 2 columns: block_top_loci
, final_combined_data
to document overlapped block info at variant level
#chr start end region_id TSS original_data combined_data combined_data_sumstats conditions conditions_top_loci block_top_loci final_combined_data
chr1 167600000 171480000 ENSG00000000457 169894266 MSBB_eQTL.ENSG00000000457.univariate_susie_twas_weights.rds, ROSMAP_Kellis_eQTL.ENSG00000000457.univariate_susie_twas_weights.rds Fungen_xQTL.ENSG00000000457.cis_results_db.export.rds Fungen_xQTL.ENSG00000000457.cis_results_db.export_sumstats.rds BM_10_MSBB_eQTL,BM_22_MSBB_eQTL,BM_36_MSBB_eQTL,BM_44_MSBB_eQTL,Ast_Kellis_eQTL BM_22_MSBB_eQTL,BM_44_MSBB_eQTL
chr19 41840000 47960000 ENSG00000130203 44905790 ROSMAP_Kellis_eQTL.ENSG00000130203.univariate_susie_twas_weights.rds, ROSMAP_Bennett_Klein_pQTL.ENSG00000130203.univariate_susie_twas_weights.rds Fungen_xQTL.ENSG00000130203.cis_results_db.export.rds Fungen_xQTL.ENSG00000130203.cis_results_db.export_sumstats.rds Mic_Kellis_eQTL,DLPFC_Bennett_pQTL Mic_Kellis_eQTL 19_44935906-46842901 Fungen_xQTL.ENSG00000130203.cis_results_db.export.overlapped.gwas.rds
chr20 49934867 53560000 ENSG00000000419 50958554 MSBB_eQTL.ENSG00000000419.univariate_susie_twas_weights.rds Fungen_xQTL.ENSG00000000419.cis_results_db.export.rds Fungen_xQTL.ENSG00000000419.cis_results_db.export_sumstats.rds BM_10_MSBB_eQTL,BM_22_MSBB_eQTL BM_22_MSBB_eQTL Fungen_xQTL.ENSG00000000419.cis_results_db.export.overlapped.gwas.rds
--context_meta
should be updated as more analysis is done, we used analysis_name and context in this analysis
analysis_name cohort context
KNIGHT_pQTL KNIGHT Knight_pQTL_brain,Knight_eQTL_brain
MSBB_eQTL MSBB BM_10_MSBB_eQTL,BM_22_MSBB_eQTL,BM_36_MSBB_eQTL,BM_44_MSBB_eQTL
MIGA_eQTL MIGA MiGA_GTS_eQTL,MiGA_SVZ_eQTL,MiGA_THA_eQTL
Analytical Logic#
Enrichment
Identify GWAS Blocks: Select GWAS blocks with top loci from Bellenguez data and using a single variant regression method. Locate the original filenames (
original_data
) for these blocks and map the analysis regions to identify overlapping gene analysis regions and corresponding QTL files with top loci table (condition_top_loci
).Contexts in xQTL Meta-data: Find contexts within the xQTL meta-data that include top loci results for each gene. Retrieve the corresponding original xQTL files (
original_data
) by referencing the context meta-data.Execute xQTL GWAS Enrichment: Perform
xqtl_gwas_enrichment
analysis to generate enrichment parameters (a0, a1) and calculate priors (p1, p2, p12).
Coloc
xQTL Meta-data Contexts: Within the xQTL meta-data, identify contexts that include top loci results for each gene, indicated by
condition_top_loci
. Retrieve the corresponding original xQTL files (original_data
) by referring back to the context meta-data. (Optional) Subset the QTL table with specifies gene list.Original GWAS Files: Identify GWAS blocks that contain overlapping top loci variants for each gene (
block_top_loci
) in above file. Utilize the GWAS list to locate the original filenames (original_data
) for the identified GWAS block candidates.Enrichment and Coloc Analysis: Automatically execute
xqtl_gwas_enrichment
to enable enrichment analysis, generating priors for subsequent coloc analysis through logic defined inenrichment
. Then, applysusie_coloc
for coloc analysis on the selected xQTL and GWAS original files, analyzing each gene under each identified condition.
Output#
Enrichment analysis results — this is a global enrichment estimate that combines all input data for each context.
Colocalization results for regions of interest — if
ld_meta_file_path
is provided, would also report coloc cs by coloc postprocessing.
Minimal Working Example Steps#
Enrichment#
sos run pipeline/SuSiE_enloc.ipynb xqtl_gwas_enrichment \
--gwas-meta-data data/susie_enloc_data/demo_gwas.block_results_db.tsv \
--xqtl-meta-data data/susie_enloc_data/demo_overlap.overlapped.gwas.tsv \
--xqtl_finemapping_obj preset_variants_result susie_result_trimmed \
--xqtl_varname_obj preset_variants_result variant_names \
--gwas_finemapping_obj AD_Bellenguez_2022 RSS_QC_RAISS_imputed susie_result_trimmed \
--gwas_varname_obj AD_Bellenguez_2022 RSS_QC_RAISS_imputed variant_names \
--xqtl_region_obj region_info grange \
--qtl-path data/susie_enloc_data \
--gwas_path data/susie_enloc_data \
--context_meta data/susie_enloc_data/context_meta.tsv \
--cwd output/xqtl_gwas_enrichment
INFO: Running get_analysis_regions: get overlapped regions by gene and block region for enrichment analysis
INFO: get_analysis_regions is completed.
INFO: get_analysis_regions output: regional_data
INFO: Running xqtl_gwas_enrichment:
INFO: xqtl_gwas_enrichment is completed.
INFO: xqtl_gwas_enrichment output: /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/xqtl_gwas_enrichment/demo_overlap.overlapped.demo_gwas.Knight_eQTL_brain.enrichment.rds
INFO: Workflow xqtl_gwas_enrichment (ID=w163a9b2b63d2f227) is executed successfully with 2 completed steps.
Coloc (would trigger enrichment automatically)#
Include the --skip-enrich
to skip the enrichment part. Otherwise, the *enrichment.rds
file from the previous step must be in the location of the --cwd
directory.
sos run pipeline/SuSiE_enloc.ipynb susie_coloc \
--gwas-meta-data data/susie_enloc_data/demo_gwas.block_results_db.tsv \
--xqtl-meta-data data/susie_enloc_data/demo_overlap.overlapped.gwas.tsv \
--xqtl_finemapping_obj preset_variants_result susie_result_trimmed \
--xqtl_varname_obj preset_variants_result variant_names \
--gwas_finemapping_obj AD_Bellenguez_2022 RSS_QC_RAISS_imputed susie_result_trimmed \
--gwas_varname_obj AD_Bellenguez_2022 RSS_QC_RAISS_imputed variant_names \
--xqtl_region_obj region_info grange \
--qtl-path data/susie_enloc_data \
--gwas_path data/susie_enloc_data \
--context_meta data/susie_enloc_data/context_meta.tsv \
--ld_meta_file_path /restricted/projectnb/xqtl/xqtl_protocol/scripts/pixi_scripts/ld_meta_file.tsv \
--cwd output/susie_coloc
INFO: Running get_overlapped_analysis_regions: get overlapped regions from flatten table, to get the regions with overlapped variants and select those regions for coloc analysis
INFO: get_overlapped_analysis_regions is completed.
INFO: get_overlapped_analysis_regions output: overlapped_regional_data
INFO: Running susie_coloc:
INFO: susie_coloc is completed.
INFO: susie_coloc output: /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/susie_coloc/susie_coloc/demo_overlap.overlapped.demo_gwas.Knight_eQTL_brain@ENSG00000142798.coloc.rds
INFO: Workflow susie_coloc (ID=w1fbe3c73c7bfc0b7) is executed successfully with 2 completed steps.
Command interface#
sos run SuSiE_enloc.ipynb -h
usage: sos run /restricted/projectnb/xqtl/xqtl_protocol/xqtl-protocol/code/pecotmr_integration/SuSiE_enloc.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
get_analysis_regions
get_overlapped_analysis_regions
xqtl_gwas_enrichment
susie_coloc
Global Workflow Options:
--cwd output (as path)
Workdir
--gwas-meta-data VAL (as path, required)
A list of file paths for fine-mapped GWAS results.
--xqtl-meta-data VAL (as path, required)
A list of file paths for fine-mapped xQTL results.
--region-list . (as path)
Optional: if a region list is provide the analysis will be focused on provided region. The LAST column of this list
will contain the ID of regions to focus on Otherwise, all regions with both genotype and phenotype files will be
analyzed
--region-name (as list)
Optional: if a region name is provided the analysis would be focused on the union of provides region list and region
names
--name f"{xqtl_meta_data:bnn}.{gwas_meta_data:bnn}"
It is required to input the name of the analysis
--container ''
--entrypoint ''
--job-size 200 (as int)
For cluster jobs, number commands to run per job
--walltime 5m
Wall clock time expected
--mem 16G
Memory expected: quite large for enrichment analysis but small for xQTL colocalization
--numThreads 1 (as int)
Number of threads
--xqtl-finemapping-obj (as list)
Optional table name in xQTL RDS files to get fine mapping results (eg 'susie_result_trimmed ').
--xqtl-varname-obj (as list)
Optional table name in xQTL RDS files to get variable names (eg ' variant_names ').
--gwas-finemapping-obj (as list)
Optional table name in GWAS RDS files to get fine mapping results (eg 'AD_Bellenguez_2022 single_effect_regression
susie_result_trimmed ').
--gwas-varname-obj (as list)
Optional table name in GWAS RDS files to get variable names(eg 'AD_Bellenguez_2022 single_effect_regression
variant_names ').
--xqtl-region-obj (as list)
Optional table name in xQTL RDS files to get region info (eg 'susie_result_trimmed region_info grange ').
--gwas-region-obj (as list)
Optional table name in GWAS RDS files to get region info (eg 'AD_Bellenguez_2022 single_effect_regression region_info
grange ').
--gwas-path ''
Directory path for GWAS orignal finemapping results
--qtl-path ''
Directory path for xQTL orignal finemapping results
--context-meta . (as path)
a meta file showing the context and corresponding analysis_name
--[no-]minp (default to False)
use conditions_top_loci_minp column or not, which can reduce a lot computing resource by pick top 1 isoform
Sections
get_analysis_regions: get overlapped regions by gene and block region for enrichment analysis
get_overlapped_analysis_regions: get overlapped regions from flatten table, to get the regions with overlapped variants and select those
regions for coloc analysis
xqtl_gwas_enrichment:
susie_coloc:
Workflow Options:
--[no-]skip-enrich (default to False)
depends: sos_step("xqtl_gwas_enrichment") #changed skip enrichment or not
--[no-]filter-lbf-cs (default to False)
filter lbf by cs as default in susie coloc. default is False means filtering by V > prior_tol
--ld-meta-file-path . (as path)
ld reference path for postprocessing
Workflow implementation#
[global]
# Workdir
parameter: cwd = path("output")
# A list of file paths for fine-mapped GWAS results.
parameter: gwas_meta_data = path
# A list of file paths for fine-mapped xQTL results.
parameter: xqtl_meta_data = path
# Optional: if a region list is provide the enrichment analysis will be focused on provided region.
# The LAST column of this list will contain the ID of regions to focus on
parameter: region_list = path()
# Optional: if a region name is provided
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# It is required to input the name of the analysis
parameter: name = f"{xqtl_meta_data:bnn}.{gwas_meta_data:bnn}"
parameter: container = ""
import re
parameter: entrypoint= ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "5m"
# Memory expected: quite large for enrichment analysis but small for xQTL colocalization
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
#Optional table name in xQTL RDS files to get fine mapping results (eg 'susie_result_trimmed ').
parameter: xqtl_finemapping_obj = []
#Optional table name in xQTL RDS files to get variable names (eg ' variant_names ').
parameter: xqtl_varname_obj = []
#Optional table name in GWAS RDS files to get fine mapping results (eg 'AD_Bellenguez_2022 single_effect_regression susie_result_trimmed ').
parameter: gwas_finemapping_obj = []
#Optional table name in GWAS RDS files to get variable names(eg 'AD_Bellenguez_2022 single_effect_regression variant_names ').
parameter: gwas_varname_obj = []
#Optional table name in xQTL RDS files to get region info (eg 'susie_result_trimmed region_info grange ').
parameter: xqtl_region_obj = []
#Optional table name in GWAS RDS files to get region info (eg 'AD_Bellenguez_2022 single_effect_regression region_info grange ').
parameter: gwas_region_obj = []
#Directory path for GWAS orignal finemapping results
parameter: gwas_path = ''
#Directory path for xQTL orignal finemapping results
parameter: qtl_path = ''
# a meta file showing the context and corresponding analysis_name
parameter: context_meta = path()
# Optional: if a region list is provide the analysis will be focused on provided region.
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# use conditions_top_loci_minp column or not, which can reduce a lot computing resource by pick top 1 isoform
parameter: minp = False
import os
import pandas as pd
import re
def group_by_region(lst, partition):
# from itertools import accumulate
# partition = [len(x) for x in partition]
# Compute the cumulative sums once
# cumsum_vector = list(accumulate(partition))
# Use slicing based on the cumulative sums
# return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
return partition
def make_unique_data(data):
return ','.join(set(data.split(',')))
def generate_meta_dataframe(meta_data_path):
"""Generate a new long metadata by converting to context:analysis_name (1:1)."""
meta = pd.read_csv(meta_data_path, sep='\t')
new_meta = pd.DataFrame()
contexts = pd.unique(meta['context'].str.split(',', expand=True).stack().str.strip())
for context in contexts:
mask = meta['context'].str.contains(context)
tmp = pd.DataFrame({
'context': [context],
'analysis_name': [','.join(meta.loc[mask, 'analysis_name'])]
})
new_meta = pd.concat([new_meta, tmp], ignore_index=True)
return new_meta
def generate_condition_based_dataframe(xqtl_df, minp):
"""Generate a new long table by converting to condition and region:original_data (1:1)."""
# only consider the QTL contexts that have top loci data
condition_column = 'conditions_top_loci_minp' if minp else 'conditions_top_loci'
conditions = pd.unique(xqtl_df[condition_column].str.split(',', expand=True).stack().str.strip())
new_df = pd.DataFrame()
for condition in conditions:
escaped_condition = re.escape(condition)
mask = xqtl_df[condition_column].str.contains(escaped_condition, case=False, regex=True)
# Iterate through each unique region in the filtered data
unique_regions = pd.unique(xqtl_df.loc[mask, 'region_id'])
for region_id in unique_regions:
region_mask = (xqtl_df['region_id'] == region_id) & mask
tmp = pd.DataFrame({
'condition': [condition],
'region_id': [region_id],
'QTL_original_data': [','.join(xqtl_df.loc[region_mask, 'original_data'])],
'GWAS_original_data': [','.join(xqtl_df.loc[region_mask, 'block_data'])]
})
new_df = pd.concat([new_df, tmp], ignore_index=True)
return new_df
def merge_and_filter_dfs(new_df, new_meta):
"""Merge and filter the dataframes based on condition/context where context is a substring of condition, and analysis_name to pick the corresponding original files."""
# Explode the QTL data into separate rows
new_df = new_df.set_index(['condition', 'region_id', 'GWAS_original_data'])['QTL_original_data'].str.split(',', expand=True).stack().reset_index(name='QTL_original_data').drop('level_3', axis=1)
new_df['analysis_name_prefix'] = new_df['QTL_original_data'].apply(lambda x: x.split('.')[0])
# Create a custom merge logic to match context as a substring of condition
def custom_merge(row, df_meta):
# Filter meta dataframe to find any context that is part of the condition string
sub_df = df_meta[df_meta['context'].apply(lambda x: x in row['condition'])]
if not sub_df.empty:
return sub_df
else:
return pd.DataFrame(columns=df_meta.columns) # Return an empty DataFrame with the same columns if no match found
# Apply custom merge function row-wise and concatenate results
results = [custom_merge(row, new_meta) for index, row in new_df.iterrows()]
merged_df = pd.concat(results, keys=new_df.index).reset_index(level=1, drop=True).join(new_df, how='outer')
# Filter rows where analysis_name_prefix matches analysis_name exactly
filtered_df = merged_df[merged_df['analysis_name_prefix'].str.strip() == merged_df['analysis_name'].str.strip()]
# Drop unnecessary columns and duplicates
filtered_df = filtered_df.drop(columns=['analysis_name_prefix', 'context']).drop_duplicates()
filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)
return filtered_df
def prepare_final_paths(filtered_df, qtl_path, gwas_path):
"""Prepare final paths for QTL and GWAS original data."""
filtered_df['QTL_original_data'] = filtered_df['QTL_original_data'].apply(
lambda x: ','.join(f"{qtl_path}/{file_name.strip()}" for file_name in x.split(','))
)
filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(
lambda x: ','.join(f"{gwas_path}/{file_name.strip()}" for file_name in x.split(','))
)
return filtered_df
# get overlapped regions by gene and block region for enrichment analysis
[get_analysis_regions: shared = "regional_data"]
import pandas as pd
def process_dataframes_with_toploci(xqtl_meta_data, gwas_meta_data):
"""Process the XQTL and GWAS dataframes."""
xqtl_df = pd.read_csv(xqtl_meta_data, sep="\t")
# filter xQTL data with the ones have overlapped top loci variants with GWAS data
xqtl_df = xqtl_df[xqtl_df['conditions_top_loci_minp' if minp else 'conditions_top_loci'].notna()]
gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
gwas_df = gwas_df[gwas_df['conditions_top_loci'].notna()]
# e.g. gwas_finemapping_obj = ['AD_Bellenguez_2022', 'single_effect_regression', 'susie_result_trimmed']
# filter gwas data with find variants in specified cohort 'AD_Bellenguez_2022' and method 'single_effect_regression'
gwas_finemapping_obj_filtered = [obj for obj in gwas_finemapping_obj if obj != 'susie_result_trimmed']
if len(gwas_finemapping_obj_filtered) > 0:
pattern = '|'.join(gwas_finemapping_obj_filtered)
gwas_df = gwas_df[gwas_df['conditions_top_loci'].str.contains(pattern)]
xqtl_df = overlapped_analysis_region(xqtl_df, gwas_df)
return xqtl_df, gwas_df
def overlapped_analysis_region(xqtl_df, gwas_df):
# Create an empty dataframe to store overlapping xQTL records
overlapped_xqtl_df = pd.DataFrame()
# Iterate over each row in the GWAS dataframe
for index, gwas_row in gwas_df.iterrows():
gwas_chr = gwas_row['#chr']
gwas_start = gwas_row['start']
gwas_end = gwas_row['end']
gwas_block_id = gwas_row['original_data'] # Assuming each GWAS row has a unique block ID
# Filter xQTL dataframe for the same chromosome
same_chr_xqtl_df = xqtl_df[xqtl_df['#chr'] == gwas_chr]
# Find overlapping xQTL regions with the current GWAS region
overlapping_xqtl = same_chr_xqtl_df[((same_chr_xqtl_df['start'] <= gwas_end) &
(same_chr_xqtl_df['end'] >= gwas_start))].copy() # Add .copy() to avoid SettingWithCopyWarning
# Check if there are any overlapping xQTLs
if not overlapping_xqtl.empty:
# Use .loc to safely assign 'block_data' to avoid the SettingWithCopyWarning
overlapping_xqtl.loc[:, 'block_data'] = gwas_block_id
# Append the overlapping xQTLs to the accumulated dataframe
overlapped_xqtl_df = pd.concat([overlapped_xqtl_df, overlapping_xqtl], ignore_index=True)
# Group by xQTL region and concatenate 'block_data'
overlapped_xqtl_df['block_data'] = overlapped_xqtl_df.groupby(['#chr', 'start', 'end'])['block_data'].transform(lambda x: ','.join(x))
overlapped_xqtl_df = overlapped_xqtl_df.drop_duplicates().reset_index(drop=True)
return overlapped_xqtl_df
xqtl_df, gwas_df = process_dataframes_with_toploci(xqtl_meta_data, gwas_meta_data)
region_ids=[]
# If region_list is provided, read the file and extract IDs
if not region_list.is_dir():
if region_list.is_file():
region_list_df = pd.read_csv(region_list, sep='\t', header=None, comment = "#")
region_ids = region_list_df.iloc[:, -1].unique() # Extracting the last column for IDs
else:
raise ValueError("The region_list path provided is not a file.")
if len(region_name) > 0:
region_ids = list(set(region_ids).union(set(region_name)))
if len(region_ids) > 0:
xqtl_df = xqtl_df[xqtl_df['region_id'].isin(region_ids)]
new_df = generate_condition_based_dataframe(xqtl_df, minp)
# if there is only one original data file, we don't have to map to context meta
if (new_df['QTL_original_data'].str.split(',').apply(lambda x: len(x) in [0, 1]).all() and not context_meta.is_file()):
filtered_df = new_df.copy()
filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)
else:
filtered_df = merge_and_filter_dfs(new_df, generate_meta_dataframe(context_meta))
filtered_df = prepare_final_paths(filtered_df, qtl_path, gwas_path)
filtered_df = filtered_df.groupby('condition').agg({
'GWAS_original_data': lambda x: ','.join(x),
'QTL_original_data': lambda x: ','.join(x)
}).reset_index()
regional_data = {
'data': [row['QTL_original_data'].split(',') for _, row in filtered_df.iterrows()],
'conditions': [(f"{row['condition']}", *row['GWAS_original_data'].split(',')) for _, row in filtered_df.iterrows()]
}
# get overlapped regions from flatten table, to get the regions with overlapped variants and select those regions for coloc analysis
[get_overlapped_analysis_regions: shared = "overlapped_regional_data"]
import pandas as pd
def map_blocks_to_data(blocks, mapping_dict):
"""Map blocks to data using a provided mapping dictionary."""
mapped_data = ','.join(mapping_dict.get(block.strip(), 'NA') for block in blocks.split(','))
return mapped_data
def process_dataframes(xqtl_meta_data, gwas_meta_data):
"""Process the XQTL and GWAS dataframes."""
xqtl_df = pd.read_csv(xqtl_meta_data, sep="\t")
# filter xQTL data with the ones have overlapped top loci variants with GWAS data
xqtl_df = xqtl_df[xqtl_df['block_top_loci'].notna()]
gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
gwas_df = gwas_df[gwas_df['conditions_top_loci'].notna()]
# e.g. gwas_finemapping_obj = ['AD_Bellenguez_2022', 'single_effect_regression', 'RSS_QC_RAISS_imputed']
# filter gwas data with find variants in specified cohort 'AD_Bellenguez_2022' and method 'RSS_QC_RAISS_imputed'
gwas_finemapping_obj_filtered = [obj for obj in gwas_finemapping_obj if obj != 'susie_result_trimmed' and obj != 'RSS_QC_RAISS_imputed']
if len(gwas_finemapping_obj_filtered) > 0:
pattern = '|'.join(gwas_finemapping_obj_filtered)
gwas_df = gwas_df[gwas_df['conditions_top_loci'].str.contains(pattern)]
gwas_df = gwas_df[gwas_df['region_id'].isin(xqtl_df['block_top_loci'].str.split(',').explode().unique())]
region_to_combined_data = dict(zip(gwas_df['region_id'], gwas_df['original_data']))
# and only consider the overlapped GWAS blocks
xqtl_df['block_data'] = xqtl_df['block_top_loci'].apply(map_blocks_to_data, args=(region_to_combined_data,))
xqtl_df['block_data'] = xqtl_df['block_data'].str.replace(',NA', '').str.replace('NA,', '')
return xqtl_df, gwas_df
xqtl_df, gwas_df = process_dataframes(xqtl_meta_data, gwas_meta_data)
region_ids=[]
# If region_list is provided, read the file and extract IDs
if not region_list.is_dir():
if region_list.is_file():
region_list_df = pd.read_csv(region_list, sep='\t', header=None, comment = "#")
region_ids = region_list_df.iloc[:, -1].unique() # Extracting the last column for IDs
else:
raise ValueError("The region_list path provided is not a file.")
if len(region_name) > 0:
region_ids = list(set(region_ids).union(set(region_name)))
if len(region_ids) > 0:
xqtl_df = xqtl_df[xqtl_df['region_id'].isin(region_ids)]
new_df = generate_condition_based_dataframe(xqtl_df, minp)
# if there is only one original data file, we don't have to map to context meta
if (new_df['QTL_original_data'].str.split(',').apply(lambda x: len(x) in [0, 1]).all() and not context_meta.is_file()):
filtered_df = new_df.copy()
filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)
else:
filtered_df = merge_and_filter_dfs(new_df, generate_meta_dataframe(context_meta))
filtered_df = prepare_final_paths(filtered_df, qtl_path, gwas_path)
overlapped_regional_data = {
'data': [row['QTL_original_data'].split(',') for _, row in filtered_df.iterrows()],
'conditions': [(f"{row['condition']}@{row['region_id']}", *row['GWAS_original_data'].split(',')) for _, row in filtered_df.iterrows()]
}
[xqtl_gwas_enrichment]
depends: sos_variable("regional_data")
stop_if(len(regional_data['data']) == 0, f'No files left for analysis')
meta = regional_data['conditions']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta"
output: f'{cwd:a}/{name}.{_meta[0]}.enrichment.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
library(tidyverse)
library(pecotmr)
# RDS files for GWAS data
gwas_finemapped_data = c(${paths([x for x in _meta[1:len(_meta)]]):r,})
# RDS files for xQTL data
xqtl_finemapped_data = c(${paths([x for x in _input]):r,})
enrich_result = xqtl_enrichment_wrapper(gwas_files = gwas_finemapped_data, xqtl_files = xqtl_finemapped_data,
xqtl_finemapping_obj = c("${_meta[0]}",${",".join(['"%s"' % x for x in xqtl_finemapping_obj]) if len(xqtl_finemapping_obj) != 0 else "NULL"}),
xqtl_varname_obj = c("${_meta[0]}",${",".join(['"%s"' % x for x in xqtl_varname_obj]) if len(xqtl_varname_obj) != 0 else "NULL"}),
gwas_finemapping_obj = c(${",".join(['"%s"' % x for x in gwas_finemapping_obj]) if len(gwas_finemapping_obj) != 0 else "NULL"}),
gwas_varname_obj = c(${",".join(['"%s"' % x for x in gwas_varname_obj]) if len(gwas_varname_obj) != 0 else "NULL"}))
saveRDS(enrich_result, ${_output:ar})
[susie_coloc]
depends: sos_variable("overlapped_regional_data")
# depends: sos_step("xqtl_gwas_enrichment") #changed
stop_if(len(overlapped_regional_data['data']) == 0, f'No files left for analysis')
# skip enrichment or not
parameter: skip_enrich=False
# filter lbf by cs as default in susie coloc. default is False means filtering by V > prior_tol
parameter: filter_lbf_cs = False
# ld reference path for postprocessing
parameter: ld_meta_file_path = path()
meta = overlapped_regional_data['conditions']
input: overlapped_regional_data["data"], group_by = lambda x: group_by_region(x, overlapped_regional_data["data"]), group_with = "meta"
# output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta[0]}.coloc.rds'
output: f'{cwd:a}/{step_name}/{name}.{_meta[0]}.coloc.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
library(tidyverse)
library(pecotmr)
library(coloc)
# RDS files for xQTL data
# RDS files for xQTL data
xqtl_finemapped_datas = c(${paths([x for x in _input]):r,})
coloc_res <- list()
# are we doing something like this? that means results are by each context, and each one has
for(xqtl_finemapped_data in xqtl_finemapped_datas){
qtl_dat <- readRDS(xqtl_finemapped_data)
gene = names(qtl_dat)
context = "${_meta[0]}" %>% str_split(.,"@",simplify = T) %>% .[,1]
gene_region = pecotmr:::get_nested_element(qtl_dat[[1]], c(context,"region_info", "grange"))
# Step 1: find relevant GWAS regions that overlap each the xQTL region of interest
gwas_finemapped_datas <- c(${ ",".join(['"%s"' % x for x in _meta[1:]])})
gwas_finemapped_datas <- gwas_finemapped_datas[grep('rds$',gwas_finemapped_datas)]
gwas_regions <- gwas_finemapped_datas %>% basename %>% str_extract(., "chr\\d+_\\d+_\\d+")
overlap_index <- NULL
for (i in 1:length(gwas_regions)) {
region <- gwas_regions[i]
split_region <- unlist(strsplit(region, "_"))
block_chrom <- as.numeric(split_region[1] %>% gsub("chr","",.))
block_start <- as.numeric(split_region[2] %>% strsplit(., "-") %>% unlist %>% .[1])
block_end <- as.numeric(split_region[2] %>% strsplit(., "-") %>% unlist %>% .[2])
if (gene_region$chrom == block_chrom && (gene_region$start <= block_end | gene_region$end >= block_start)) {
overlap_index <- c(overlap_index, i)
}
}
if (!is.null(overlap_index)) {
gwas_finemapped_data <- gwas_finemapped_datas[overlap_index]
# Step 2: load enrichment analysis results
# Extract values for p1, p2, and p12
if(${"FALSE" if skip_enrich else "TRUE"}){
enrich_file = paste0('${cwd:a}','/', '${name}', '.' ,context,'.enrichment.rds')
p1 <- readRDS(enrich_file)[[1]]$`Alternative (coloc) p1`
p2 <- readRDS(enrich_file)[[1]]$`Alternative (coloc) p2`
p12 <- readRDS(enrich_file)[[1]]$`Alternative (coloc) p12`
message("Priors are P1:", p1, "; p2: ", p2, "; p12: ", p12)
} else {
p1 = 1e-4
p2 = 1e-4
p12 = 5e-6
message("Priors are using default, P1: 1e-4, P2: 1e-4, P12: 5e-6" )
}
# Step 3: Apply colocalization analysis between each condition and GWAS
coloc_res[[gene]] <- coloc_wrapper(xqtl_file = xqtl_finemapped_data, gwas_files = gwas_finemapped_data,
xqtl_finemapping_obj = c(context,${",".join(['"%s"' % x for x in xqtl_finemapping_obj]) if len(xqtl_finemapping_obj) != 0 else "NULL"}),
xqtl_varname_obj = c(context,${",".join(['"%s"' % x for x in xqtl_varname_obj]) if len(xqtl_varname_obj) != 0 else "NULL"}),
gwas_finemapping_obj = c(${",".join(['"%s"' % x for x in gwas_finemapping_obj]) if len(gwas_finemapping_obj) != 0 else "NULL"}),
gwas_varname_obj = c(${",".join(['"%s"' % x for x in gwas_varname_obj]) if len(gwas_varname_obj) != 0 else "NULL"}),
xqtl_region_obj = c(context,${",".join(['"%s"' % x for x in xqtl_region_obj]) if len(xqtl_region_obj) != 0 else "NULL"}),
gwas_region_obj = c(${",".join(['"%s"' % x for x in gwas_region_obj]) if len(gwas_region_obj) != 0 else "NULL"}),
p1 = p1, p2 = p2, p12 = p12, filter_lbf_cs = ${"FALSE" if skip_enrich else "TRUE"})
if (${"TRUE" if ld_meta_file_path.is_file() else "FALSE"}) {
if(length(coloc_res[[gene]]) > 2) {
coloc_res[[gene]] <- coloc_post_processor(coloc_res[[gene]], LD_meta_file_path = ${ld_meta_file_path:r}, analysis_region = coloc_res[[gene]]$analysis_region)
if(!is.null(coloc_res[[gene]]$sets$cs)) writeLines(coloc_res[[gene]]$sets$cs %>% unlist, gsub("rds$","coloc_res","${_output}"))
} else { message("No coloc results were generated.") }
}
} else {
message("No overlap was found between GWAS blocks and QTL analysis region.")
coloc_res <- "No overlap was found between GWAS blocks and QTL analysis region."
}
}
saveRDS(coloc_res, ${_output:r})