xQTL-GWAS pairwise enrichment and colocalization#

Description [FIXME]#

This workflow processes fine-mapping results for xQTL, generated by susie_twas in the mnm_regression.ipynb notebook for cis xQTL, and GWAS fine-mapping results produced by susie_rss in the rss_analysis.ipynb notebook. It is designed to perform enrichment and colocalization analysis, particularly when fine-mapping results originate from different regions in the case of cis-xQTL and GWAS. The pipeline is capable to integrate and analyze data across these distinct regions. Originally tailored for cis-xQTL and GWAS integration, this pipeline can be applied to other pairwise integrations. An example of such application is in trans analysis, where the fine-mapped regions might be identical between trans-xQTL and GWAS, representing a special case of this broader implementation.

Input#

Lists of SuSiE fine-mapping output objects, in RDS format, of class(susie) in R.

  • For xQTL the list is meta-data of format: chr, start, end, original_data, conditions_top_loci, block_top_loci where original_data is an RDS file, conditions_top_loci is showing which contexts have top loci table (potential signals) block_top_loci is the blocks have overlapped top loci variant with xQTL data. That file could be output from fine_mapping_post_processing/overlap_qtl_gwas.

  • For GWAS the list is meta-data of format: chr, start, end, original_data, block_top_loci... where original_data is an RDS file. That file could be output from fine_mapping_post_processing/gwas_results_export.

  • Context meta is a metafile that shows the analysis_names and the contained contexts. It can be used to identify the corresponding raw data for each context.

  • --gwas_meta_data
    output from fine_mapping_post_processing/gwas_results_export

#chr	start	end	region_id	TSS	original_data	combined_data	combined_data_sumstats	conditions	conditions_top_loci
chr1	101384274	104443097	1_101384274-104443097	NA	1_101384274-104443097.susie_rss.rds	gwas.1_101384274-104443097.cis_results_db.export.rds	gwas.1_101384274-104443097.cis_results_db.export_sumstats.rds	NA	NA
chr19	44935906	46842901	19_44935906-46842901	NA	19_44935906-46842901.susie_rss.rds	gwas.19_44935906-46842901.cis_results_db.export.rds	gwas.19_44935906-46842901.cis_results_db.export_sumstats.rds	AD_Bellenguez_2022,AD_Jansen_2021,AD_Kunkle_Stage1_2019,AD_Wightman_Full_2021,AD_Wightman_Excluding23andMe_2021,AD_Wightman_ExcludingUKBand23andME_2021	AD_Wightman_ExcludingUKBand23andME_2021.qc_impute,AD_Wightman_ExcludingUKBand23andME_2021.qc_only	AD_Wightman_ExcludingUKBand23andME_2021.qc_impute
  • --xqtl_meta_data

for enrichment: output from fine_mapping_post_processing/cis_results_export; for coloc: output from fine_mapping_post_processing/overlap_qtl_gwas, the difference between them is without or with last 2 columns: block_top_loci, final_combined_data to document overlapped block info at variant level

#chr	start	end	region_id	TSS	original_data	combined_data	combined_data_sumstats	conditions	conditions_top_loci	block_top_loci	final_combined_data
chr1	167600000	171480000	ENSG00000000457	169894266	MSBB_eQTL.ENSG00000000457.univariate_susie_twas_weights.rds, ROSMAP_Kellis_eQTL.ENSG00000000457.univariate_susie_twas_weights.rds	Fungen_xQTL.ENSG00000000457.cis_results_db.export.rds	Fungen_xQTL.ENSG00000000457.cis_results_db.export_sumstats.rds	BM_10_MSBB_eQTL,BM_22_MSBB_eQTL,BM_36_MSBB_eQTL,BM_44_MSBB_eQTL,Ast_Kellis_eQTL	BM_22_MSBB_eQTL,BM_44_MSBB_eQTL		
chr19	41840000	47960000	ENSG00000130203	44905790	ROSMAP_Kellis_eQTL.ENSG00000130203.univariate_susie_twas_weights.rds, ROSMAP_Bennett_Klein_pQTL.ENSG00000130203.univariate_susie_twas_weights.rds	Fungen_xQTL.ENSG00000130203.cis_results_db.export.rds	Fungen_xQTL.ENSG00000130203.cis_results_db.export_sumstats.rds	Mic_Kellis_eQTL,DLPFC_Bennett_pQTL	Mic_Kellis_eQTL	19_44935906-46842901	Fungen_xQTL.ENSG00000130203.cis_results_db.export.overlapped.gwas.rds
chr20	49934867	53560000	ENSG00000000419	50958554	MSBB_eQTL.ENSG00000000419.univariate_susie_twas_weights.rds	Fungen_xQTL.ENSG00000000419.cis_results_db.export.rds	Fungen_xQTL.ENSG00000000419.cis_results_db.export_sumstats.rds	BM_10_MSBB_eQTL,BM_22_MSBB_eQTL	BM_22_MSBB_eQTL	Fungen_xQTL.ENSG00000000419.cis_results_db.export.overlapped.gwas.rds
  • --context_meta

should be updated as more analysis is done, we used analysis_name and context in this analysis

analysis_name	cohort	context
KNIGHT_pQTL	KNIGHT	Knight_pQTL_brain,Knight_eQTL_brain
MSBB_eQTL	MSBB	BM_10_MSBB_eQTL,BM_22_MSBB_eQTL,BM_36_MSBB_eQTL,BM_44_MSBB_eQTL
MIGA_eQTL	MIGA	MiGA_GTS_eQTL,MiGA_SVZ_eQTL,MiGA_THA_eQTL

Analytical Logic#

  • Enrichment

  1. Identify GWAS Blocks: Select GWAS blocks with top loci from Bellenguez data and using a single variant regression method. Locate the original filenames (original_data) for these blocks and map the analysis regions to identify overlapping gene analysis regions and corresponding QTL files with top loci table (condition_top_loci).

  2. Contexts in xQTL Meta-data: Find contexts within the xQTL meta-data that include top loci results for each gene. Retrieve the corresponding original xQTL files (original_data) by referencing the context meta-data.

  3. Execute xQTL GWAS Enrichment: Perform xqtl_gwas_enrichment analysis to generate enrichment parameters (a0, a1) and calculate priors (p1, p2, p12).

  • Coloc

  1. xQTL Meta-data Contexts: Within the xQTL meta-data, identify contexts that include top loci results for each gene, indicated by condition_top_loci. Retrieve the corresponding original xQTL files (original_data) by referring back to the context meta-data. (Optional) Subset the QTL table with specifies gene list.

  2. Original GWAS Files: Identify GWAS blocks that contain overlapping top loci variants for each gene (block_top_loci) in above file. Utilize the GWAS list to locate the original filenames (original_data) for the identified GWAS block candidates.

  3. Enrichment and Coloc Analysis: Automatically execute xqtl_gwas_enrichment to enable enrichment analysis, generating priors for subsequent coloc analysis through logic defined in enrichment. Then, apply susie_coloc for coloc analysis on the selected xQTL and GWAS original files, analyzing each gene under each identified condition.

Output#

  1. Enrichment analysis results — this is a global enrichment estimate that combines all input data for each context.

  2. Colocalization results for regions of interest — if ld_meta_file_path is provided, would also report coloc cs by coloc postprocessing.

Minimal Working Example Steps#

Enrichment#

sos run pipeline/SuSiE_enloc.ipynb xqtl_gwas_enrichment    \
    --gwas-meta-data data/susie_enloc_data/demo_gwas.block_results_db.tsv \
    --xqtl-meta-data data/susie_enloc_data/demo_overlap.overlapped.gwas.tsv \
    --xqtl_finemapping_obj preset_variants_result susie_result_trimmed  \
    --xqtl_varname_obj preset_variants_result variant_names  \
    --gwas_finemapping_obj AD_Bellenguez_2022 RSS_QC_RAISS_imputed susie_result_trimmed \
    --gwas_varname_obj  AD_Bellenguez_2022 RSS_QC_RAISS_imputed variant_names \
    --xqtl_region_obj  region_info   grange \
    --qtl-path data/susie_enloc_data \
    --gwas_path data/susie_enloc_data \
    --context_meta data/susie_enloc_data/context_meta.tsv \
    --cwd output/xqtl_gwas_enrichment
INFO: Running get_analysis_regions: get overlapped regions by gene and block region for enrichment analysis
INFO: get_analysis_regions is completed.
INFO: get_analysis_regions output:   regional_data
INFO: Running xqtl_gwas_enrichment: 
INFO: xqtl_gwas_enrichment is completed.
INFO: xqtl_gwas_enrichment output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/xqtl_gwas_enrichment/demo_overlap.overlapped.demo_gwas.Knight_eQTL_brain.enrichment.rds
INFO: Workflow xqtl_gwas_enrichment (ID=w163a9b2b63d2f227) is executed successfully with 2 completed steps.

Coloc (would trigger enrichment automatically)#

Include the --skip-enrich to skip the enrichment part. Otherwise, the *enrichment.rds file from the previous step must be in the location of the --cwd directory.

sos run pipeline/SuSiE_enloc.ipynb susie_coloc \
    --gwas-meta-data data/susie_enloc_data/demo_gwas.block_results_db.tsv \
    --xqtl-meta-data data/susie_enloc_data/demo_overlap.overlapped.gwas.tsv \
    --xqtl_finemapping_obj preset_variants_result susie_result_trimmed \
    --xqtl_varname_obj preset_variants_result variant_names \
    --gwas_finemapping_obj AD_Bellenguez_2022 RSS_QC_RAISS_imputed susie_result_trimmed \
    --gwas_varname_obj  AD_Bellenguez_2022 RSS_QC_RAISS_imputed variant_names \
    --xqtl_region_obj  region_info grange \
    --qtl-path data/susie_enloc_data \
    --gwas_path data/susie_enloc_data \
    --context_meta data/susie_enloc_data/context_meta.tsv \
    --ld_meta_file_path /restricted/projectnb/xqtl/xqtl_protocol/scripts/pixi_scripts/ld_meta_file.tsv \
    --cwd output/susie_coloc
INFO: Running get_overlapped_analysis_regions: get overlapped regions from flatten table, to get the regions with overlapped variants and select those regions for coloc analysis
INFO: get_overlapped_analysis_regions is completed.
INFO: get_overlapped_analysis_regions output:   overlapped_regional_data
INFO: Running susie_coloc: 
INFO: susie_coloc is completed.
INFO: susie_coloc output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/susie_coloc/susie_coloc/demo_overlap.overlapped.demo_gwas.Knight_eQTL_brain@ENSG00000142798.coloc.rds
INFO: Workflow susie_coloc (ID=w1fbe3c73c7bfc0b7) is executed successfully with 2 completed steps.

Command interface#

sos run SuSiE_enloc.ipynb -h
usage: sos run /restricted/projectnb/xqtl/xqtl_protocol/xqtl-protocol/code/pecotmr_integration/SuSiE_enloc.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  get_analysis_regions
  get_overlapped_analysis_regions
  xqtl_gwas_enrichment
  susie_coloc

Global Workflow Options:
  --cwd output (as path)
                        Workdir
  --gwas-meta-data VAL (as path, required)
                        A list of file paths for fine-mapped GWAS results.
  --xqtl-meta-data VAL (as path, required)
                        A list of file paths for fine-mapped xQTL results.
  --region-list . (as path)
                        Optional: if a region list is provide the analysis will be focused on provided region. The LAST column of this list
                        will contain the ID of regions to focus on Otherwise, all regions with both genotype and phenotype files will be
                        analyzed
  --region-name  (as list)
                        Optional: if a region name is provided the analysis would be focused on the union of provides region list and region
                        names
  --name  f"{xqtl_meta_data:bnn}.{gwas_meta_data:bnn}"

                        It is required to input the name of the analysis
  --container ''
  --entrypoint ''
  --job-size 200 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5m
                        Wall clock time expected
  --mem 16G
                        Memory expected: quite large for enrichment analysis but small for xQTL colocalization
  --numThreads 1 (as int)
                        Number of threads
  --xqtl-finemapping-obj  (as list)
                        Optional table name in xQTL RDS files to get fine mapping results (eg 'susie_result_trimmed ').
  --xqtl-varname-obj  (as list)
                        Optional table name in xQTL RDS files to get variable names (eg ' variant_names ').
  --gwas-finemapping-obj  (as list)
                        Optional table name in GWAS RDS files to get fine mapping results (eg 'AD_Bellenguez_2022 single_effect_regression
                        susie_result_trimmed ').
  --gwas-varname-obj  (as list)
                        Optional table name in GWAS RDS files to get variable names(eg 'AD_Bellenguez_2022 single_effect_regression
                        variant_names ').
  --xqtl-region-obj  (as list)
                        Optional table name in xQTL RDS files to get region info (eg 'susie_result_trimmed region_info grange ').
  --gwas-region-obj  (as list)
                        Optional table name in GWAS RDS files to get region info (eg 'AD_Bellenguez_2022 single_effect_regression region_info
                        grange ').
  --gwas-path ''
                        Directory path for GWAS orignal finemapping results
  --qtl-path ''
                        Directory path for xQTL orignal finemapping results
  --context-meta . (as path)
                        a meta file showing the context and corresponding analysis_name
  --[no-]minp (default to False)
                        use conditions_top_loci_minp column or not, which can reduce a lot computing resource by pick top 1 isoform

Sections
  get_analysis_regions: get overlapped regions by gene and block region for enrichment analysis
  get_overlapped_analysis_regions: get overlapped regions from flatten table, to get the regions with overlapped variants and select those
                        regions for coloc analysis
  xqtl_gwas_enrichment:
  susie_coloc:
    Workflow Options:
      --[no-]skip-enrich (default to False)
                        depends: sos_step("xqtl_gwas_enrichment") #changed skip enrichment or not
      --[no-]filter-lbf-cs (default to False)
                        filter lbf by cs as default in susie coloc. default is False means filtering by V > prior_tol
      --ld-meta-file-path . (as path)
                        ld reference path for postprocessing

Workflow implementation#

[global]
# Workdir
parameter: cwd = path("output")
# A list of file paths for fine-mapped GWAS results. 
parameter: gwas_meta_data = path
# A list of file paths for fine-mapped xQTL results. 
parameter: xqtl_meta_data = path
# Optional: if a region list is provide the enrichment analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# It is required to input the name of the analysis
parameter: name = f"{xqtl_meta_data:bnn}.{gwas_meta_data:bnn}"
parameter: container = ""
import re
parameter: entrypoint= ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "5m"
# Memory expected: quite large for enrichment analysis but small for xQTL colocalization
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
#Optional table name in xQTL RDS files to get fine mapping results (eg 'susie_result_trimmed ').
parameter: xqtl_finemapping_obj = []
#Optional table name in xQTL RDS files to get variable names (eg ' variant_names ').
parameter: xqtl_varname_obj = []
#Optional table name in GWAS RDS files to get fine mapping results (eg 'AD_Bellenguez_2022 single_effect_regression susie_result_trimmed ').
parameter: gwas_finemapping_obj = []
#Optional table name in GWAS RDS files to get variable names(eg 'AD_Bellenguez_2022 single_effect_regression variant_names ').
parameter: gwas_varname_obj = []
#Optional table name in xQTL RDS files to get region info (eg 'susie_result_trimmed region_info grange ').
parameter: xqtl_region_obj = []
#Optional table name in GWAS RDS files to get region info (eg 'AD_Bellenguez_2022 single_effect_regression region_info grange ').
parameter: gwas_region_obj = []
#Directory path for GWAS orignal finemapping results 
parameter: gwas_path = ''
#Directory path for xQTL orignal finemapping results 
parameter: qtl_path = ''
# a meta file showing the context and corresponding analysis_name
parameter: context_meta = path()
# Optional: if a region list is provide the analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# use conditions_top_loci_minp column or not, which can reduce a lot computing resource by pick top 1 isoform
parameter: minp = False
import os
import pandas as pd
import re
def group_by_region(lst, partition):
    # from itertools import accumulate
    # partition = [len(x) for x in partition]
    # Compute the cumulative sums once
    # cumsum_vector = list(accumulate(partition))
    # Use slicing based on the cumulative sums
    # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
    return partition

def make_unique_data(data):
    return ','.join(set(data.split(',')))

def generate_meta_dataframe(meta_data_path):
    """Generate a new long metadata by converting to context:analysis_name (1:1)."""
    meta = pd.read_csv(meta_data_path, sep='\t')
    new_meta = pd.DataFrame()
    contexts = pd.unique(meta['context'].str.split(',', expand=True).stack().str.strip())

    for context in contexts:
        mask = meta['context'].str.contains(context)
        tmp = pd.DataFrame({
            'context': [context],
            'analysis_name': [','.join(meta.loc[mask, 'analysis_name'])]
        })
        new_meta = pd.concat([new_meta, tmp], ignore_index=True)

    return new_meta

def generate_condition_based_dataframe(xqtl_df, minp):
    """Generate a new long table by converting to condition and region:original_data (1:1)."""
    # only consider the QTL contexts that have top loci data
    condition_column = 'conditions_top_loci_minp' if minp else 'conditions_top_loci'
    conditions = pd.unique(xqtl_df[condition_column].str.split(',', expand=True).stack().str.strip())
    new_df = pd.DataFrame()

    for condition in conditions:
        escaped_condition = re.escape(condition)
        mask = xqtl_df[condition_column].str.contains(escaped_condition, case=False, regex=True)
        
        # Iterate through each unique region in the filtered data
        unique_regions = pd.unique(xqtl_df.loc[mask, 'region_id'])
        for region_id in unique_regions:
            region_mask = (xqtl_df['region_id'] == region_id) & mask
            tmp = pd.DataFrame({
                'condition': [condition],
                'region_id': [region_id],
                'QTL_original_data': [','.join(xqtl_df.loc[region_mask, 'original_data'])],
                'GWAS_original_data': [','.join(xqtl_df.loc[region_mask, 'block_data'])]
            })
            new_df = pd.concat([new_df, tmp], ignore_index=True)

    return new_df


def merge_and_filter_dfs(new_df, new_meta):
    """Merge and filter the dataframes based on condition/context where context is a substring of condition, and analysis_name to pick the corresponding original files."""
    # Explode the QTL data into separate rows
    new_df = new_df.set_index(['condition', 'region_id', 'GWAS_original_data'])['QTL_original_data'].str.split(',', expand=True).stack().reset_index(name='QTL_original_data').drop('level_3', axis=1)
    new_df['analysis_name_prefix'] = new_df['QTL_original_data'].apply(lambda x: x.split('.')[0])

    # Create a custom merge logic to match context as a substring of condition
    def custom_merge(row, df_meta):
        # Filter meta dataframe to find any context that is part of the condition string
        sub_df = df_meta[df_meta['context'].apply(lambda x: x in row['condition'])]
        if not sub_df.empty:
            return sub_df
        else:
            return pd.DataFrame(columns=df_meta.columns)  # Return an empty DataFrame with the same columns if no match found

    # Apply custom merge function row-wise and concatenate results
    results = [custom_merge(row, new_meta) for index, row in new_df.iterrows()]
    merged_df = pd.concat(results, keys=new_df.index).reset_index(level=1, drop=True).join(new_df, how='outer')

    # Filter rows where analysis_name_prefix matches analysis_name exactly
    filtered_df = merged_df[merged_df['analysis_name_prefix'].str.strip() == merged_df['analysis_name'].str.strip()]

    # Drop unnecessary columns and duplicates
    filtered_df = filtered_df.drop(columns=['analysis_name_prefix', 'context']).drop_duplicates()
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)

    return filtered_df

def prepare_final_paths(filtered_df, qtl_path, gwas_path):
    """Prepare final paths for QTL and GWAS original data."""
    filtered_df['QTL_original_data'] = filtered_df['QTL_original_data'].apply(
        lambda x: ','.join(f"{qtl_path}/{file_name.strip()}" for file_name in x.split(','))
    )
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(
        lambda x: ','.join(f"{gwas_path}/{file_name.strip()}" for file_name in x.split(','))
    )
    return filtered_df
# get overlapped regions by gene and block region for enrichment analysis
[get_analysis_regions: shared = "regional_data"]
import pandas as pd

def process_dataframes_with_toploci(xqtl_meta_data, gwas_meta_data):
    """Process the XQTL and GWAS dataframes."""
    xqtl_df = pd.read_csv(xqtl_meta_data, sep="\t")
    # filter xQTL data with the ones have overlapped top loci variants with GWAS data
    xqtl_df = xqtl_df[xqtl_df['conditions_top_loci_minp' if minp else 'conditions_top_loci'].notna()]

    gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
    gwas_df = gwas_df[gwas_df['conditions_top_loci'].notna()]
    # e.g. gwas_finemapping_obj = ['AD_Bellenguez_2022', 'single_effect_regression', 'susie_result_trimmed']
    # filter gwas data with find variants in specified cohort 'AD_Bellenguez_2022' and method 'single_effect_regression'
    gwas_finemapping_obj_filtered = [obj for obj in gwas_finemapping_obj if obj != 'susie_result_trimmed']
    if len(gwas_finemapping_obj_filtered) > 0:
        pattern = '|'.join(gwas_finemapping_obj_filtered)
        gwas_df = gwas_df[gwas_df['conditions_top_loci'].str.contains(pattern)]

    xqtl_df = overlapped_analysis_region(xqtl_df, gwas_df)
    return xqtl_df, gwas_df



def overlapped_analysis_region(xqtl_df, gwas_df):
    # Create an empty dataframe to store overlapping xQTL records
    overlapped_xqtl_df = pd.DataFrame()
    
    # Iterate over each row in the GWAS dataframe
    for index, gwas_row in gwas_df.iterrows():
        gwas_chr = gwas_row['#chr']
        gwas_start = gwas_row['start']
        gwas_end = gwas_row['end']
        gwas_block_id = gwas_row['original_data']  # Assuming each GWAS row has a unique block ID
        
        # Filter xQTL dataframe for the same chromosome
        same_chr_xqtl_df = xqtl_df[xqtl_df['#chr'] == gwas_chr]
        
        # Find overlapping xQTL regions with the current GWAS region
        overlapping_xqtl = same_chr_xqtl_df[((same_chr_xqtl_df['start'] <= gwas_end) & 
                                             (same_chr_xqtl_df['end'] >= gwas_start))].copy() # Add .copy() to avoid SettingWithCopyWarning
        
        # Check if there are any overlapping xQTLs
        if not overlapping_xqtl.empty:
            # Use .loc to safely assign 'block_data' to avoid the SettingWithCopyWarning
            overlapping_xqtl.loc[:, 'block_data'] = gwas_block_id
            
            # Append the overlapping xQTLs to the accumulated dataframe
            overlapped_xqtl_df = pd.concat([overlapped_xqtl_df, overlapping_xqtl], ignore_index=True)
        
    # Group by xQTL region and concatenate 'block_data'
    overlapped_xqtl_df['block_data'] = overlapped_xqtl_df.groupby(['#chr', 'start', 'end'])['block_data'].transform(lambda x: ','.join(x))
    overlapped_xqtl_df = overlapped_xqtl_df.drop_duplicates().reset_index(drop=True)

    return overlapped_xqtl_df


xqtl_df, gwas_df = process_dataframes_with_toploci(xqtl_meta_data, gwas_meta_data)

region_ids=[]
# If region_list is provided, read the file and extract IDs
if not region_list.is_dir():
    if region_list.is_file():
        region_list_df = pd.read_csv(region_list, sep='\t', header=None, comment = "#")
        region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs
    else:
        raise ValueError("The region_list path provided is not a file.")
        
if len(region_name) > 0:
    region_ids = list(set(region_ids).union(set(region_name)))
    
if len(region_ids) > 0:
    xqtl_df = xqtl_df[xqtl_df['region_id'].isin(region_ids)]
    
new_df = generate_condition_based_dataframe(xqtl_df, minp)

# if there is only one original data file, we don't have to map to context meta
if (new_df['QTL_original_data'].str.split(',').apply(lambda x: len(x) in [0, 1]).all() and not context_meta.is_file()):
    filtered_df = new_df.copy()
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)
else:
    filtered_df = merge_and_filter_dfs(new_df, generate_meta_dataframe(context_meta))

filtered_df = prepare_final_paths(filtered_df, qtl_path, gwas_path)
filtered_df = filtered_df.groupby('condition').agg({
    'GWAS_original_data': lambda x: ','.join(x),
    'QTL_original_data': lambda x: ','.join(x)
}).reset_index()
regional_data = {
    'data': [row['QTL_original_data'].split(',') for _, row in filtered_df.iterrows()],
    'conditions': [(f"{row['condition']}", *row['GWAS_original_data'].split(',')) for _, row in filtered_df.iterrows()]
}
# get overlapped regions from flatten table, to get the regions with overlapped variants and select those regions for coloc analysis
[get_overlapped_analysis_regions: shared = "overlapped_regional_data"]
import pandas as pd
def map_blocks_to_data(blocks, mapping_dict):
    """Map blocks to data using a provided mapping dictionary."""
    mapped_data = ','.join(mapping_dict.get(block.strip(), 'NA') for block in blocks.split(','))
    return mapped_data

def process_dataframes(xqtl_meta_data, gwas_meta_data):
    """Process the XQTL and GWAS dataframes."""
    xqtl_df = pd.read_csv(xqtl_meta_data, sep="\t")
    # filter xQTL data with the ones have overlapped top loci variants with GWAS data
    xqtl_df = xqtl_df[xqtl_df['block_top_loci'].notna()]

    gwas_df = pd.read_csv(gwas_meta_data, sep="\t")    
    gwas_df = gwas_df[gwas_df['conditions_top_loci'].notna()]
    # e.g. gwas_finemapping_obj = ['AD_Bellenguez_2022', 'single_effect_regression', 'RSS_QC_RAISS_imputed']
    # filter gwas data with find variants in specified cohort 'AD_Bellenguez_2022' and method 'RSS_QC_RAISS_imputed'
    gwas_finemapping_obj_filtered = [obj for obj in gwas_finemapping_obj if obj != 'susie_result_trimmed' and obj != 'RSS_QC_RAISS_imputed']
    if len(gwas_finemapping_obj_filtered) > 0:
        pattern = '|'.join(gwas_finemapping_obj_filtered)
        gwas_df = gwas_df[gwas_df['conditions_top_loci'].str.contains(pattern)]

    gwas_df = gwas_df[gwas_df['region_id'].isin(xqtl_df['block_top_loci'].str.split(',').explode().unique())]

    region_to_combined_data = dict(zip(gwas_df['region_id'], gwas_df['original_data']))
    # and only consider the overlapped GWAS blocks
    xqtl_df['block_data'] = xqtl_df['block_top_loci'].apply(map_blocks_to_data, args=(region_to_combined_data,))
    xqtl_df['block_data'] = xqtl_df['block_data'].str.replace(',NA', '').str.replace('NA,', '')

    return xqtl_df, gwas_df

xqtl_df, gwas_df = process_dataframes(xqtl_meta_data, gwas_meta_data)

region_ids=[]
# If region_list is provided, read the file and extract IDs
if not region_list.is_dir():
    if region_list.is_file():
        region_list_df = pd.read_csv(region_list, sep='\t', header=None, comment = "#")
        region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs
    else:
        raise ValueError("The region_list path provided is not a file.")
        
if len(region_name) > 0:
    region_ids = list(set(region_ids).union(set(region_name)))
    
if len(region_ids) > 0:
    xqtl_df = xqtl_df[xqtl_df['region_id'].isin(region_ids)]
    
    
new_df = generate_condition_based_dataframe(xqtl_df, minp)

# if there is only one original data file, we don't have to map to context meta
if (new_df['QTL_original_data'].str.split(',').apply(lambda x: len(x) in [0, 1]).all() and not context_meta.is_file()):
    filtered_df = new_df.copy()
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)
else:
    filtered_df = merge_and_filter_dfs(new_df, generate_meta_dataframe(context_meta))

filtered_df = prepare_final_paths(filtered_df, qtl_path, gwas_path)

overlapped_regional_data = {
    'data': [row['QTL_original_data'].split(',') for _, row in filtered_df.iterrows()],
    'conditions': [(f"{row['condition']}@{row['region_id']}", *row['GWAS_original_data'].split(',')) for _, row in filtered_df.iterrows()]
}
[xqtl_gwas_enrichment]
depends: sos_variable("regional_data")
stop_if(len(regional_data['data']) == 0, f'No files left for analysis')

meta = regional_data['conditions']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta"
output: f'{cwd:a}/{name}.{_meta[0]}.enrichment.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    library(tidyverse)
    library(pecotmr)
    # RDS files for GWAS data
    gwas_finemapped_data = c(${paths([x for x in _meta[1:len(_meta)]]):r,})
    # RDS files for xQTL data
    xqtl_finemapped_data = c(${paths([x for x in _input]):r,})
    enrich_result = xqtl_enrichment_wrapper(gwas_files = gwas_finemapped_data, xqtl_files = xqtl_finemapped_data, 
                                                xqtl_finemapping_obj =  c("${_meta[0]}",${",".join(['"%s"' % x  for x in xqtl_finemapping_obj]) if len(xqtl_finemapping_obj) != 0 else "NULL"}), 
                                                xqtl_varname_obj =   c("${_meta[0]}",${",".join(['"%s"' % x  for x in xqtl_varname_obj]) if len(xqtl_varname_obj) != 0 else "NULL"}), 
                                                gwas_finemapping_obj =  c(${",".join(['"%s"' % x for x in gwas_finemapping_obj]) if len(gwas_finemapping_obj) != 0 else "NULL"}), 
                                                gwas_varname_obj =  c(${",".join(['"%s"' % x for x in gwas_varname_obj]) if len(gwas_varname_obj) != 0 else "NULL"}))
    saveRDS(enrich_result, ${_output:ar})
[susie_coloc]
depends: sos_variable("overlapped_regional_data")
# depends: sos_step("xqtl_gwas_enrichment") #changed
stop_if(len(overlapped_regional_data['data']) == 0, f'No files left for analysis')
# skip enrichment or not
parameter: skip_enrich=False
# filter lbf by cs as default in susie coloc. default is False means filtering by V > prior_tol
parameter: filter_lbf_cs = False
# ld reference path for postprocessing 
parameter: ld_meta_file_path = path()
meta = overlapped_regional_data['conditions']
input: overlapped_regional_data["data"], group_by = lambda x: group_by_region(x, overlapped_regional_data["data"]), group_with = "meta"
# output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta[0]}.coloc.rds'
output: f'{cwd:a}/{step_name}/{name}.{_meta[0]}.coloc.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
    library(tidyverse)
    library(pecotmr)
    library(coloc) 
    # RDS files for xQTL data
     # RDS files for xQTL data
    xqtl_finemapped_datas = c(${paths([x for x in _input]):r,})
    coloc_res <- list()
    # are we doing something like this? that means results are by each context, and each one has 
    for(xqtl_finemapped_data in xqtl_finemapped_datas){
      qtl_dat <- readRDS(xqtl_finemapped_data)
      gene = names(qtl_dat)
      context = "${_meta[0]}" %>% str_split(.,"@",simplify = T) %>% .[,1] 
      gene_region = pecotmr:::get_nested_element(qtl_dat[[1]],  c(context,"region_info", "grange"))
  
      # Step 1: find relevant GWAS regions that overlap each the xQTL region of interest
      gwas_finemapped_datas <- c(${ ",".join(['"%s"' % x for x in _meta[1:]])}) 
      gwas_finemapped_datas <- gwas_finemapped_datas[grep('rds$',gwas_finemapped_datas)]
 
      gwas_regions <- gwas_finemapped_datas %>% basename %>% str_extract(., "chr\\d+_\\d+_\\d+")
      overlap_index <- NULL
      for (i in 1:length(gwas_regions)) {
        region <- gwas_regions[i]
        split_region <- unlist(strsplit(region, "_"))
        block_chrom <- as.numeric(split_region[1] %>% gsub("chr","",.))
        block_start <- as.numeric(split_region[2] %>% strsplit(., "-") %>% unlist %>% .[1])
        block_end <- as.numeric(split_region[2] %>% strsplit(., "-") %>% unlist %>% .[2])
        if (gene_region$chrom == block_chrom && (gene_region$start <= block_end |  gene_region$end >= block_start)) {
          overlap_index <- c(overlap_index, i)
        }
      }
  
     if (!is.null(overlap_index)) {
       gwas_finemapped_data <- gwas_finemapped_datas[overlap_index]
  
       # Step 2: load enrichment analysis results
       # Extract values for p1, p2, and p12
     if(${"FALSE" if skip_enrich else "TRUE"}){
       enrich_file = paste0('${cwd:a}','/', '${name}', '.' ,context,'.enrichment.rds')
       p1 <-  readRDS(enrich_file)[[1]]$`Alternative (coloc) p1`
       p2 <-  readRDS(enrich_file)[[1]]$`Alternative (coloc) p2`
       p12 <-  readRDS(enrich_file)[[1]]$`Alternative (coloc) p12`
      
       message("Priors are P1:", p1, "; p2: ", p2, "; p12: ", p12)
     } else {
       p1 = 1e-4
       p2 = 1e-4
       p12 = 5e-6
       message("Priors are using default, P1: 1e-4, P2: 1e-4, P12: 5e-6" )
     }
       # Step 3: Apply colocalization analysis between each condition and GWAS
       coloc_res[[gene]] <- coloc_wrapper(xqtl_file = xqtl_finemapped_data, gwas_files = gwas_finemapped_data, 
                                          xqtl_finemapping_obj =  c(context,${",".join(['"%s"' % x  for x in xqtl_finemapping_obj]) if len(xqtl_finemapping_obj) != 0 else "NULL"}), 
                                          xqtl_varname_obj =   c(context,${",".join(['"%s"' % x  for x in xqtl_varname_obj]) if len(xqtl_varname_obj) != 0 else "NULL"}), 
                                          gwas_finemapping_obj =  c(${",".join(['"%s"' % x for x in gwas_finemapping_obj]) if len(gwas_finemapping_obj) != 0 else "NULL"}), 
                                          gwas_varname_obj =  c(${",".join(['"%s"' % x for x in gwas_varname_obj]) if len(gwas_varname_obj) != 0 else "NULL"}),
                                          xqtl_region_obj =   c(context,${",".join(['"%s"' % x  for x in xqtl_region_obj]) if len(xqtl_region_obj) != 0 else "NULL"}), 
                                          gwas_region_obj =  c(${",".join(['"%s"' % x for x in gwas_region_obj]) if len(gwas_region_obj) != 0 else "NULL"}),
                                          p1 = p1, p2 = p2, p12 = p12, filter_lbf_cs = ${"FALSE" if skip_enrich else "TRUE"})

        if (${"TRUE" if ld_meta_file_path.is_file() else "FALSE"}) {
          if(length(coloc_res[[gene]]) > 2) {
              coloc_res[[gene]] <- coloc_post_processor(coloc_res[[gene]], LD_meta_file_path = ${ld_meta_file_path:r}, analysis_region = coloc_res[[gene]]$analysis_region)
            if(!is.null(coloc_res[[gene]]$sets$cs))  writeLines(coloc_res[[gene]]$sets$cs %>% unlist, gsub("rds$","coloc_res","${_output}"))
          } else { message("No coloc results were generated.") }
        }
      
      } else {
        message("No overlap was found between GWAS blocks and QTL analysis region.")
        coloc_res <-  "No overlap was found between GWAS blocks and QTL analysis region."
      }
    }
    saveRDS(coloc_res, ${_output:r})