xQTL-GWAS pairwise enrichment and colocalization

xQTL-GWAS pairwise enrichment and colocalization#

Run pairwise enrichment and colocalization between xQTL and GWAS SuSiE fine-mapping results on the toy dataset. The workflow first estimates a global enrichment between the two sets of signals, then performs colocalization (coloc) on the overlapping regions.

Description#

This workflow processes fine-mapping results for xQTL, generated by susie_twas in the mnm_regression.ipynb notebook for cis xQTL, and GWAS fine-mapping results produced by susie_rss in the rss_analysis.ipynb notebook. It is designed to perform enrichment and colocalization analysis, particularly when fine-mapping results originate from different regions in the case of cis-xQTL and GWAS. The pipeline is capable to integrate and analyze data across these distinct regions. Originally tailored for cis-xQTL and GWAS integration, this pipeline can be applied to other pairwise integrations. An example of such application is in trans analysis, where the fine-mapped regions might be identical between trans-xQTL and GWAS, representing a special case of this broader implementation.

Input#

Lists of SuSiE fine-mapping output objects, in RDS format, of class(susie) in R.

For xQTL the list is meta-data of format: chr, start, end, original_data, conditions_top_loci, block_top_loci where original_data is an RDS file, conditions_top_loci is showing which contexts have top loci table (potential signals) block_top_loci is the blocks have overlapped top loci variant with xQTL data. That file could be output from fine_mapping_post_processing/overlap_qtl_gwas.
For GWAS the list is meta-data of format: chr, start, end, original_data, block_top_loci... where original_data is an RDS file. That file could be output from fine_mapping_post_processing/gwas_results_export.
Context meta is a metafile that shows the analysis_names and the contained contexts. It can be used to identify the corresponding raw data for each context.
--gwas_meta_data
output from fine_mapping_post_processing/gwas_results_export

#chr	start	end	region_id	TSS	original_data	combined_data	combined_data_sumstats	conditions	conditions_top_loci
chr1	101384274	104443097	1_101384274-104443097	NA	1_101384274-104443097.susie_rss.rds	gwas.1_101384274-104443097.cis_results_db.export.rds	gwas.1_101384274-104443097.cis_results_db.export_sumstats.rds	NA	NA
chr19	44935906	46842901	19_44935906-46842901	NA	19_44935906-46842901.susie_rss.rds	gwas.19_44935906-46842901.cis_results_db.export.rds	gwas.19_44935906-46842901.cis_results_db.export_sumstats.rds	AD_Bellenguez_2022,AD_Jansen_2021,AD_Kunkle_Stage1_2019,AD_Wightman_Full_2021,AD_Wightman_Excluding23andMe_2021,AD_Wightman_ExcludingUKBand23andME_2021	AD_Wightman_ExcludingUKBand23andME_2021.qc_impute,AD_Wightman_ExcludingUKBand23andME_2021.qc_only	AD_Wightman_ExcludingUKBand23andME_2021.qc_impute

--xqtl_meta_data

for enrichment: output from fine_mapping_post_processing/cis_results_export; for coloc: output from fine_mapping_post_processing/overlap_qtl_gwas, the difference between them is without or with last 2 columns: block_top_loci, final_combined_data to document overlapped block info at variant level

#chr	start	end	region_id	TSS	original_data	combined_data	combined_data_sumstats	conditions	conditions_top_loci	block_top_loci	final_combined_data
chr1	167600000	171480000	ENSG00000000457	169894266	MSBB_eQTL.ENSG00000000457.univariate_susie_twas_weights.rds, ROSMAP_Kellis_eQTL.ENSG00000000457.univariate_susie_twas_weights.rds	Fungen_xQTL.ENSG00000000457.cis_results_db.export.rds	Fungen_xQTL.ENSG00000000457.cis_results_db.export_sumstats.rds	BM_10_MSBB_eQTL,BM_22_MSBB_eQTL,BM_36_MSBB_eQTL,BM_44_MSBB_eQTL,Ast_Kellis_eQTL	BM_22_MSBB_eQTL,BM_44_MSBB_eQTL		
chr19	41840000	47960000	ENSG00000130203	44905790	ROSMAP_Kellis_eQTL.ENSG00000130203.univariate_susie_twas_weights.rds, ROSMAP_Bennett_Klein_pQTL.ENSG00000130203.univariate_susie_twas_weights.rds	Fungen_xQTL.ENSG00000130203.cis_results_db.export.rds	Fungen_xQTL.ENSG00000130203.cis_results_db.export_sumstats.rds	Mic_Kellis_eQTL,DLPFC_Bennett_pQTL	Mic_Kellis_eQTL	19_44935906-46842901	Fungen_xQTL.ENSG00000130203.cis_results_db.export.overlapped.gwas.rds
chr20	49934867	53560000	ENSG00000000419	50958554	MSBB_eQTL.ENSG00000000419.univariate_susie_twas_weights.rds	Fungen_xQTL.ENSG00000000419.cis_results_db.export.rds	Fungen_xQTL.ENSG00000000419.cis_results_db.export_sumstats.rds	BM_10_MSBB_eQTL,BM_22_MSBB_eQTL	BM_22_MSBB_eQTL	Fungen_xQTL.ENSG00000000419.cis_results_db.export.overlapped.gwas.rds

--context_meta

should be updated as more analysis is done, we used analysis_name and context in this analysis

analysis_name	cohort	context
KNIGHT_pQTL	KNIGHT	Knight_pQTL_brain,Knight_eQTL_brain
MSBB_eQTL	MSBB	BM_10_MSBB_eQTL,BM_22_MSBB_eQTL,BM_36_MSBB_eQTL,BM_44_MSBB_eQTL
MIGA_eQTL	MIGA	MiGA_GTS_eQTL,MiGA_SVZ_eQTL,MiGA_THA_eQTL

Analytical Logic#

Enrichment

Identify GWAS Blocks: Select GWAS blocks with top loci from Bellenguez data and using a single variant regression method. Locate the original filenames (original_data) for these blocks and map the analysis regions to identify overlapping gene analysis regions and corresponding QTL files with top loci table (condition_top_loci).
Contexts in xQTL Meta-data: Find contexts within the xQTL meta-data that include top loci results for each gene. Retrieve the corresponding original xQTL files (original_data) by referencing the context meta-data.
Execute xQTL GWAS Enrichment: Perform xqtl_gwas_enrichment analysis to generate enrichment parameters (a0, a1) and calculate priors (p1, p2, p12).

Coloc

xQTL Meta-data Contexts: Within the xQTL meta-data, identify contexts that include top loci results for each gene, indicated by condition_top_loci. Retrieve the corresponding original xQTL files (original_data) by referring back to the context meta-data. (Optional) Subset the QTL table with specifies gene list.
Original GWAS Files: Identify GWAS blocks that contain overlapping top loci variants for each gene (block_top_loci) in above file. Utilize the GWAS list to locate the original filenames (original_data) for the identified GWAS block candidates.
Enrichment and Coloc Analysis: Automatically execute xqtl_gwas_enrichment to enable enrichment analysis, generating priors for subsequent coloc analysis through logic defined in enrichment. Then, apply susie_coloc for coloc analysis on the selected xQTL and GWAS original files, analyzing each gene under each identified condition.

Steps#

Step 1. Enrichment. Estimate the global xQTL-GWAS enrichment. get_analysis_regions first collects the overlapped regions per gene/block, then xqtl_gwas_enrichment produces the enrichment object used as the prior for coloc.

Timing: ~varies on typical compute infrastructure.

sos run pipeline/SuSiE_enloc.ipynb xqtl_gwas_enrichment \
    --gwas-meta-data input/susie_enloc_data/protocol_example.enloc.gwas_meta.tsv \
    --xqtl-meta-data input/susie_enloc_data/protocol_example.enloc.xqtl_meta.tsv \
    --xqtl-finemapping-obj preset_variants_result susie_result_trimmed \
    --xqtl-varname-obj preset_variants_result variant_names \
    --gwas-finemapping-obj AD_Bellenguez_2022 RSS_QC_RAISS_imputed susie_result_trimmed \
    --gwas-varname-obj  AD_Bellenguez_2022 RSS_QC_RAISS_imputed variant_names \
    --xqtl-region-obj   region_info grange \
    --qtl-path input/susie_enloc_data \
    --gwas-path input/susie_enloc_data \
    --context-meta input/susie_enloc_data/protocol_example.enloc.context_meta.tsv \
    --cwd output/xqtl_gwas_enrichment

Step 2. Coloc. Run colocalization on the overlapping regions. --skip-enrich reuses the enrichment object produced in Step 1, so the *enrichment.rds file from that step must be present in --cwd. Otherwise enrichment is triggered automatically. get_overlapped_analysis_regions selects the regions with overlapping variants before susie_coloc runs.

sos run pipeline/SuSiE_enloc.ipynb susie_coloc \
    --gwas-meta-data input/susie_enloc_data/protocol_example.enloc.gwas_meta.tsv \
    --xqtl-meta-data input/susie_enloc_data/protocol_example.enloc.xqtl_meta.tsv \
    --xqtl-finemapping-obj preset_variants_result susie_result_trimmed \
    --xqtl-varname-obj preset_variants_result variant_names \
    --gwas-finemapping-obj AD_Bellenguez_2022 RSS_QC_RAISS_imputed susie_result_trimmed \
    --gwas-varname-obj  AD_Bellenguez_2022 RSS_QC_RAISS_imputed variant_names \
    --xqtl-region-obj   region_info grange \
    --qtl-path input/susie_enloc_data \
    --gwas-path input/susie_enloc_data \
    --context-meta input/susie_enloc_data/protocol_example.enloc.context_meta.tsv \
    --ld-meta-file-path input/ld_reference/protocol_example.ld_meta_file.tsv \
    --skip-enrich \
    --cwd output/susie_coloc

Command interface#

sos run pipeline/SuSiE_enloc.ipynb -h

Workflow implementation#

[global]
# Workdir
parameter: cwd = path("output")
# A list of file paths for fine-mapped GWAS results. 
parameter: gwas_meta_data = path
# A list of file paths for fine-mapped xQTL results. 
parameter: xqtl_meta_data = path
# Optional: if a region list is provide the enrichment analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# It is required to input the name of the analysis
parameter: name = f"{xqtl_meta_data:bnn}.{gwas_meta_data:bnn}"
parameter: container = ""
import re
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "5m"
# Memory expected: quite large for enrichment analysis but small for xQTL colocalization
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
#Optional table name in xQTL RDS files to get fine mapping results (eg 'susie_result_trimmed ').
parameter: xqtl_finemapping_obj = []
#Optional table name in xQTL RDS files to get variable names (eg ' variant_names ').
parameter: xqtl_varname_obj = []
#Optional table name in GWAS RDS files to get fine mapping results (eg 'AD_Bellenguez_2022 single_effect_regression susie_result_trimmed ').
parameter: gwas_finemapping_obj = []
#Optional table name in GWAS RDS files to get variable names(eg 'AD_Bellenguez_2022 single_effect_regression variant_names ').
parameter: gwas_varname_obj = []
#Optional table name in xQTL RDS files to get region info (eg 'susie_result_trimmed region_info grange ').
parameter: xqtl_region_obj = []
#Optional table name in GWAS RDS files to get region info (eg 'AD_Bellenguez_2022 single_effect_regression region_info grange ').
parameter: gwas_region_obj = []
#Directory path for GWAS orignal finemapping results 
parameter: gwas_path = ''
#Directory path for xQTL orignal finemapping results 
parameter: qtl_path = ''
# a meta file showing the context and corresponding analysis_name
parameter: context_meta = path()
# Optional: if a region list is provide the analysis will be focused on provided region. 
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided 
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# use conditions_top_loci_minp column or not, which can reduce a lot computing resource by pick top 1 isoform
parameter: minp = False
import os
import pandas as pd
import re
def group_by_region(lst, partition):
    # from itertools import accumulate
    # partition = [len(x) for x in partition]
    # Compute the cumulative sums once
    # cumsum_vector = list(accumulate(partition))
    # Use slicing based on the cumulative sums
    # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
    return partition

def make_unique_data(data):
    return ','.join(set(data.split(',')))

def generate_meta_dataframe(meta_data_path):
    """Generate a new long metadata by converting to context:analysis_name (1:1)."""
    meta = pd.read_csv(meta_data_path, sep='\t')
    new_meta = pd.DataFrame()
    contexts = pd.unique(meta['context'].str.split(',', expand=True).stack().str.strip())

    for context in contexts:
        if not isinstance(context, str):
            continue
        mask = meta['context'].str.contains(context)
        tmp = pd.DataFrame({
            'context': [context],
            'analysis_name': [','.join(meta.loc[mask, 'analysis_name'])]
        })
        new_meta = pd.concat([new_meta, tmp], ignore_index=True)

    return new_meta

def generate_condition_based_dataframe(xqtl_df, minp):
    """Generate a new long table by converting to condition and region:original_data (1:1)."""
    # only consider the QTL contexts that have top loci data
    condition_column = 'conditions_top_loci_minp' if minp else 'conditions_top_loci'
    conditions = pd.unique(xqtl_df[condition_column].str.split(',', expand=True).stack().str.strip())
    new_df = pd.DataFrame()

    for condition in conditions:
        if not isinstance(condition, str):
            continue
        escaped_condition = re.escape(condition)
        mask = xqtl_df[condition_column].str.contains(escaped_condition, case=False, regex=True)
        
        # Iterate through each unique region in the filtered data
        unique_regions = pd.unique(xqtl_df.loc[mask, 'region_id'])
        for region_id in unique_regions:
            region_mask = (xqtl_df['region_id'] == region_id) & mask
            tmp = pd.DataFrame({
                'condition': [condition],
                'region_id': [region_id],
                'QTL_original_data': [','.join(xqtl_df.loc[region_mask, 'original_data'])],
                'GWAS_original_data': [','.join(xqtl_df.loc[region_mask, 'block_data'])]
            })
            new_df = pd.concat([new_df, tmp], ignore_index=True)

    return new_df


def merge_and_filter_dfs(new_df, new_meta):
    """Merge and filter the dataframes based on condition/context where context is a substring of condition, and analysis_name to pick the corresponding original files."""
    # Explode the QTL data into separate rows
    new_df = new_df.set_index(['condition', 'region_id', 'GWAS_original_data'])['QTL_original_data'].str.split(',', expand=True).stack().reset_index(name='QTL_original_data').drop('level_3', axis=1)
    new_df['analysis_name_prefix'] = new_df['QTL_original_data'].apply(lambda x: x.split('.')[0])

    # Create a custom merge logic to match context as a substring of condition
    def custom_merge(row, df_meta):
        # Filter meta dataframe to find any context that is part of the condition string
        sub_df = df_meta[df_meta['context'].apply(lambda x: x in row['condition'])]
        if not sub_df.empty:
            return sub_df
        else:
            return pd.DataFrame(columns=df_meta.columns)  # Return an empty DataFrame with the same columns if no match found

    # Apply custom merge function row-wise and concatenate results
    results = [custom_merge(row, new_meta) for index, row in new_df.iterrows()]
    merged_df = pd.concat(results, keys=new_df.index).reset_index(level=1, drop=True).join(new_df, how='outer')

    # Filter rows where analysis_name_prefix matches analysis_name exactly
    filtered_df = merged_df[merged_df['analysis_name_prefix'].str.strip() == merged_df['analysis_name'].str.strip()]

    # Drop unnecessary columns and duplicates
    filtered_df = filtered_df.drop(columns=['analysis_name_prefix', 'context']).drop_duplicates()
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)

    return filtered_df

def prepare_final_paths(filtered_df, qtl_path, gwas_path):
    """Prepare final paths for QTL and GWAS original data."""
    filtered_df['QTL_original_data'] = filtered_df['QTL_original_data'].apply(
        lambda x: ','.join(f"{qtl_path}/{file_name.strip()}" for file_name in x.split(','))
    )
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(
        lambda x: ','.join(f"{gwas_path}/{file_name.strip()}" for file_name in x.split(','))
    )
    return filtered_df

# get overlapped regions by gene and block region for enrichment analysis
[get_analysis_regions: shared = "regional_data"]
import pandas as pd

def process_dataframes_with_toploci(xqtl_meta_data, gwas_meta_data):
    """Process the XQTL and GWAS dataframes."""
    xqtl_df = pd.read_csv(xqtl_meta_data, sep="\t")
    # filter xQTL data with the ones have overlapped top loci variants with GWAS data
    xqtl_df = xqtl_df[xqtl_df['conditions_top_loci_minp' if minp else 'conditions_top_loci'].notna()]

    gwas_df = pd.read_csv(gwas_meta_data, sep="\t")
    gwas_df = gwas_df[gwas_df['conditions_top_loci'].notna()]
    # e.g. gwas_finemapping_obj = ['AD_Bellenguez_2022', 'single_effect_regression', 'susie_result_trimmed']
    # filter gwas data with find variants in specified cohort 'AD_Bellenguez_2022' and method 'single_effect_regression'
    gwas_finemapping_obj_filtered = [obj for obj in gwas_finemapping_obj if obj != 'susie_result_trimmed']
    if len(gwas_finemapping_obj_filtered) > 0:
        pattern = '|'.join(gwas_finemapping_obj_filtered)
        gwas_df = gwas_df[gwas_df['conditions_top_loci'].str.contains(pattern)]

    xqtl_df = overlapped_analysis_region(xqtl_df, gwas_df)
    return xqtl_df, gwas_df



def overlapped_analysis_region(xqtl_df, gwas_df):
    # Create an empty dataframe to store overlapping xQTL records
    overlapped_xqtl_df = pd.DataFrame()
    
    # Iterate over each row in the GWAS dataframe
    for index, gwas_row in gwas_df.iterrows():
        gwas_chr = gwas_row['#chr']
        gwas_start = gwas_row['start']
        gwas_end = gwas_row['end']
        gwas_block_id = gwas_row['original_data']  # Assuming each GWAS row has a unique block ID
        
        # Filter xQTL dataframe for the same chromosome
        same_chr_xqtl_df = xqtl_df[xqtl_df['#chr'] == gwas_chr]
        
        # Find overlapping xQTL regions with the current GWAS region
        overlapping_xqtl = same_chr_xqtl_df[((same_chr_xqtl_df['start'] <= gwas_end) & 
                                             (same_chr_xqtl_df['end'] >= gwas_start))].copy() # Add .copy() to avoid SettingWithCopyWarning
        
        # Check if there are any overlapping xQTLs
        if not overlapping_xqtl.empty:
            # Use .loc to safely assign 'block_data' to avoid the SettingWithCopyWarning
            overlapping_xqtl.loc[:, 'block_data'] = gwas_block_id
            
            # Append the overlapping xQTLs to the accumulated dataframe
            overlapped_xqtl_df = pd.concat([overlapped_xqtl_df, overlapping_xqtl], ignore_index=True)
        
    # Group by xQTL region and concatenate 'block_data'
    overlapped_xqtl_df['block_data'] = overlapped_xqtl_df.groupby(['#chr', 'start', 'end'])['block_data'].transform(lambda x: ','.join(x))
    overlapped_xqtl_df = overlapped_xqtl_df.drop_duplicates().reset_index(drop=True)

    return overlapped_xqtl_df


xqtl_df, gwas_df = process_dataframes_with_toploci(xqtl_meta_data, gwas_meta_data)

region_ids=[]
# If region_list is provided, read the file and extract IDs
if not region_list.is_dir():
    if region_list.is_file():
        region_list_df = pd.read_csv(region_list, sep='\t', header=None, comment = "#")
        region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs
    else:
        raise ValueError("The region_list path provided is not a file.")
        
if len(region_name) > 0:
    region_ids = list(set(region_ids).union(set(region_name)))
    
if len(region_ids) > 0:
    xqtl_df = xqtl_df[xqtl_df['region_id'].isin(region_ids)]
    
new_df = generate_condition_based_dataframe(xqtl_df, minp)

# if there is only one original data file, we don't have to map to context meta
if (new_df['QTL_original_data'].str.split(',').apply(lambda x: len(x) in [0, 1]).all() and not context_meta.is_file()):
    filtered_df = new_df.copy()
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)
else:
    filtered_df = merge_and_filter_dfs(new_df, generate_meta_dataframe(context_meta))

filtered_df = prepare_final_paths(filtered_df, qtl_path, gwas_path)
filtered_df = filtered_df.groupby('condition').agg({
    'GWAS_original_data': lambda x: ','.join(x),
    'QTL_original_data': lambda x: ','.join(x)
}).reset_index()
regional_data = {
    'data': [row['QTL_original_data'].split(',') for _, row in filtered_df.iterrows()],
    'conditions': [(f"{row['condition']}", *row['GWAS_original_data'].split(',')) for _, row in filtered_df.iterrows()]
}

# get overlapped regions from flatten table, to get the regions with overlapped variants and select those regions for coloc analysis
[get_overlapped_analysis_regions: shared = "overlapped_regional_data"]
import pandas as pd
def map_blocks_to_data(blocks, mapping_dict):
    """Map blocks to data using a provided mapping dictionary."""
    mapped_data = ','.join(mapping_dict.get(block.strip(), 'NA') for block in blocks.split(','))
    return mapped_data

def process_dataframes(xqtl_meta_data, gwas_meta_data):
    """Process the XQTL and GWAS dataframes."""
    xqtl_df = pd.read_csv(xqtl_meta_data, sep="\t")
    # filter xQTL data with the ones have overlapped top loci variants with GWAS data
    xqtl_df = xqtl_df[xqtl_df['block_top_loci'].notna()]

    gwas_df = pd.read_csv(gwas_meta_data, sep="\t")    
    gwas_df = gwas_df[gwas_df['conditions_top_loci'].notna()]
    # e.g. gwas_finemapping_obj = ['AD_Bellenguez_2022', 'single_effect_regression', 'RSS_QC_RAISS_imputed']
    # filter gwas data with find variants in specified cohort 'AD_Bellenguez_2022' and method 'RSS_QC_RAISS_imputed'
    gwas_finemapping_obj_filtered = [obj for obj in gwas_finemapping_obj if obj != 'susie_result_trimmed' and obj != 'RSS_QC_RAISS_imputed']
    if len(gwas_finemapping_obj_filtered) > 0:
        pattern = '|'.join(gwas_finemapping_obj_filtered)
        gwas_df = gwas_df[gwas_df['conditions_top_loci'].str.contains(pattern)]

    gwas_df = gwas_df[gwas_df['region_id'].isin(xqtl_df['block_top_loci'].str.split(',').explode().unique())]

    region_to_combined_data = dict(zip(gwas_df['region_id'], gwas_df['original_data']))
    # and only consider the overlapped GWAS blocks
    xqtl_df['block_data'] = xqtl_df['block_top_loci'].apply(map_blocks_to_data, args=(region_to_combined_data,))
    xqtl_df['block_data'] = xqtl_df['block_data'].str.replace(',NA', '').str.replace('NA,', '')

    return xqtl_df, gwas_df

xqtl_df, gwas_df = process_dataframes(xqtl_meta_data, gwas_meta_data)

region_ids=[]
# If region_list is provided, read the file and extract IDs
if not region_list.is_dir():
    if region_list.is_file():
        region_list_df = pd.read_csv(region_list, sep='\t', header=None, comment = "#")
        region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs
    else:
        raise ValueError("The region_list path provided is not a file.")
        
if len(region_name) > 0:
    region_ids = list(set(region_ids).union(set(region_name)))
    
if len(region_ids) > 0:
    xqtl_df = xqtl_df[xqtl_df['region_id'].isin(region_ids)]
    
    
new_df = generate_condition_based_dataframe(xqtl_df, minp)

# if there is only one original data file, we don't have to map to context meta
if (new_df['QTL_original_data'].str.split(',').apply(lambda x: len(x) in [0, 1]).all() and not context_meta.is_file()):
    filtered_df = new_df.copy()
    filtered_df['GWAS_original_data'] = filtered_df['GWAS_original_data'].apply(make_unique_data)
else:
    filtered_df = merge_and_filter_dfs(new_df, generate_meta_dataframe(context_meta))

filtered_df = prepare_final_paths(filtered_df, qtl_path, gwas_path)

overlapped_regional_data = {
    'data': [row['QTL_original_data'].split(',') for _, row in filtered_df.iterrows()],
    'conditions': [(f"{row['condition']}@{row['region_id']}", *row['GWAS_original_data'].split(',')) for _, row in filtered_df.iterrows()]
}

[xqtl_gwas_enrichment]
depends: sos_variable("regional_data")
stop_if(len(regional_data['data']) == 0, f'No files left for analysis')

meta = regional_data['conditions']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta"
output: f'{cwd:a}/{name}.{_meta[0]}.enrichment.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
    set -e
    # Per-region xQTL files for this group, and the GWAS rds entries in _meta[1:].
    # Convert each side of this region to a pecotmr S4 collection, then run
    # qtlEnrichmentPipeline on the pair.
    Rscript code/script/pecotmr_integration/legacy_enloc_finemap_convert.R \
        --mode qtl \
        --rds-files ${",".join([str(x) for x in _input])} \
        --finemapping-obj '${" ".join(xqtl_finemapping_obj)}' \
        --varname-obj '${" ".join(xqtl_varname_obj)}' \
        --study ${name} \
        --output ${_output:n}.qtl.rds

    Rscript code/script/pecotmr_integration/legacy_enloc_finemap_convert.R \
        --mode gwas \
        --rds-files ${",".join([str(x) for x in _meta[1:] if str(x).endswith(".rds")])} \
        --finemapping-obj '${" ".join(gwas_finemapping_obj)}' \
        --varname-obj '${" ".join(gwas_varname_obj)}' \
        --output ${_output:n}.gwas.rds

    Rscript code/script/pecotmr_integration/qtl_enrichment.R \
        --qtl-fine-mapping ${_output:n}.qtl.rds \
        --gwas-fine-mapping ${_output:n}.gwas.rds \
        --ncore ${numThreads} \
        --output ${_output}

[susie_coloc]
depends: sos_variable("overlapped_regional_data")
# depends: sos_step("xqtl_gwas_enrichment") #changed
stop_if(len(overlapped_regional_data['data']) == 0, f'No files left for analysis')
# skip enrichment or not
parameter: skip_enrich=False
# filter lbf by cs as default in susie coloc. default is False means filtering by V > prior_tol
parameter: filter_lbf_cs = False
# ld reference path for postprocessing 
parameter: ld_meta_file_path = path()
meta = overlapped_regional_data['conditions']
input: overlapped_regional_data["data"], group_by = lambda x: group_by_region(x, overlapped_regional_data["data"]), group_with = "meta"
# output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta[0]}.coloc.rds'
output: f'{cwd:a}/{step_name}/{name}.{_meta[0]}.coloc.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
    set -e
    # context (before '@' in _meta[0]) selects the per-region enrichment file the
    # enrichment step wrote: {cwd}/{name}.{context}.enrichment.rds (legacy naming).
    # Convert this region's xQTL + GWAS sides, then run colocPipeline (enloc
    # variant when --enrichment is supplied).
    Rscript code/script/pecotmr_integration/legacy_enloc_finemap_convert.R \
        --mode qtl \
        --rds-files ${",".join([str(x) for x in _input])} \
        --finemapping-obj '${" ".join(xqtl_finemapping_obj)}' \
        --varname-obj '${" ".join(xqtl_varname_obj)}' \
        --study ${name} \
        --output ${_output:n}.qtl.rds

    Rscript code/script/pecotmr_integration/legacy_enloc_finemap_convert.R \
        --mode gwas \
        --rds-files ${",".join([str(x) for x in _meta[1:] if str(x).endswith(".rds")])} \
        --finemapping-obj '${" ".join(gwas_finemapping_obj)}' \
        --varname-obj '${" ".join(gwas_varname_obj)}' \
        --output ${_output:n}.gwas.rds

    Rscript code/script/pecotmr_integration/coloc.R \
        --qtl-fine-mapping ${_output:n}.qtl.rds \
        --gwas-input ${_output:n}.gwas.rds \
        ${('--enrichment '+cwd.absolute().joinpath(name + "." + _meta[0].split("@")[0] + ".enrichment.rds").as_posix()) if not skip_enrich else ''} \
        --output ${_output}

Anticipated Results#

Enrichment analysis results — this is a global enrichment estimate that combines all input data for each context.
Colocalization results for regions of interest — if ld_meta_file_path is provided, would also report coloc cs by coloc postprocessing.

xQTL-GWAS pairwise enrichment and colocalization

Contents

xQTL-GWAS pairwise enrichment and colocalization#

Description#

Input#

Analytical Logic#

Analytical Logic#

Steps#

Command interface#

Workflow implementation#

Anticipated Results#