Quantile regression for QTL association testing#
Description#
This module performs quantile QTL (QR) association testing and computes quantile TWAS weights. For each genomic region it fits quantile regression across quantile levels (0.05, 0.1, …, 0.95) and integrates the quantile-level p-values into a single QR p-value using the Cauchy combination method, then derives TWAS weights for use in downstream gene-trait association.
The analysis is region-based: regions can be selected by name (--region-name), by a region list (--region-list), or, if neither is given, the union of all regions present in the phenotype region lists is analyzed.
Timing: ~2-5 min on typical compute infrastructure.
Step 1: Quantile QTL association and TWAS weight#
For each region the workflow fits quantile regression of the molecular phenotype on genotype across the quantile grid, combines the per-quantile p-values into a single QR p-value via the Cauchy combination method, and computes quantile TWAS weights. Covariates (genotype PCs, hidden factors, fixed covariates) are regressed out, and cis/trans windows are taken from a customized association-window file when provided, otherwise a fixed cis-window around each region is used.
Inputs
Option |
Description |
|---|---|
|
Either a list of per-chromosome genotype files, or one PLINK |
|
One or more lists of phenotype files per region ( |
|
One or more covariate files matched to the phenotype lists (genotype PCs, hidden factors, fixed covariates). |
|
Optional 4-column file (chr, start, end, region ID) defining cis/trans windows. If omitted, a fixed cis-window around each region is used. |
|
Optional ways to restrict the analysis to selected regions. Region IDs must match the 4th column of the phenotype list. |
|
Names for the phenotypic conditions ( |
Example genotype list (--genoFile):
#chr path
chr21 protocol_example.genotype.chr21.bed
chr22 protocol_example.genotype.chr22.bed
Alternatively use protocol_example.genotype.chr21_22.bed if all chromosomes are in one file.
Example association-window file (strictly 4 columns, header commented out):
#chr start end gene_id
chr10 0 6480000 ENSG00000008128
chr1 0 6480000 ENSG00000008130
chr1 0 6480000 ENSG00000067606
Output
The pipeline writes results to the --cwd directory, named after the workflow step. Each region yields a quantile-QTL association result table (per-quantile and integrated Cauchy QR p-values) and the corresponding quantile TWAS weight object used downstream.
sos run pipeline/qr_and_twas.ipynb quantile_qtl_twas_weight \
--name protocol_example_protein \
--genoFile output/plink/protocol_example.genotype.chr22.bed \
--phenoFile output/phenotype_protein/protocol_example_protein.phenotype_by_chrom_files.region_list.txt \
--covFile output/covariate_protein/protocol_example_protein.chr22.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
--customized-association-windows input/reference_data/TAD/protocol_example_protein.enhanced_cis_chr22.bed \
--region-name ENSG00000241973_P42356 \
--cwd output/quantile_twas \
--phenotype-names protein
sos run pipeline/qr_and_twas.ipynb quantile_qtl_twas_weight \
--name protocol_example_protein \
--genoFile output/plink/protocol_example.genotype.chr22.bed \
--phenoFile output/phenotype_protein/protocol_example_protein.phenotype_by_chrom_files.region_list.txt \
--covFile output/covariate_protein/protocol_example_protein.chr22.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
--customized-association-windows input/reference_data/TAD/protocol_example_protein.enhanced_cis_chr22.bed \
--region-list input/reference_data/TAD/protocol_example_protein.enhanced_cis_chr22.bed \
--cwd output/quantile_twas \
--phenotype-names protein
sos run pipeline/qr_and_twas.ipynb -h
Workflow implementation#
The cells below define the SoS workflow steps used by the commands above. They are not meant to be edited for routine analysis.
[global]
# It is required to input the name of the analysis
parameter: name = str
parameter: cwd = path("output")
# A list of file paths for genotype data, or the genotype data itself.
parameter: genoFile = path
# One or multiple lists of file paths for phenotype data.
parameter: phenoFile = paths
# One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.
parameter: phenoIDFile = paths()
# Covariate file path
parameter: covFile = paths
# Optional: if a region list is provide the analysis will be focused on provided region.
# The LAST column of this list will contain the ID of regions to focus on
# Otherwise, all regions with both genotype and phenotype files will be analyzed
parameter: region_list = path()
# Optional: if a region name is provided
# the analysis would be focused on the union of provides region list and region names
parameter: region_name = []
# Only focus on a subset of samples
parameter: keep_samples = path()
# An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_association_windows = path()
# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# When this is set to negative, we will rely on using customized_association_windows
parameter: cis_window = -1
# save data object or not
parameter: save_data = False
# Name of phenotypes
parameter: phenotype_names = [f'{x:bn}' for x in phenoFile]
parameter: seed = 999
# remove a variant if it has more than imiss missing individual level data
parameter: imiss = 1.0
# MAF cutoff
parameter: maf = 0.0025
# MAC cutoff, on top of MAF cutoff
parameter: mac = 5
# Remove indels if indel = False
parameter: indel = True
parameter: min_twas_maf = 0.01
parameter: screen_threshold = 0.01
parameter: screen_method = "qvalue"
parameter: screen_significant = True
parameter: pre_filter_by_pqr = False
parameter: initial_corr_filter_cutoff = 0.8
parameter: full_rank_corr_filter_cutoff = "seq(0.75, 0.5, by = -0.05)"
parameter: ld_reference_meta_file = path()
# Only focus on a subset of variants
parameter: keep_variants = path()
parameter: marginal_beta_calculate = True
parameter: twas_weight_calculate = True
parameter: qrank_screen_calculate = True
parameter: vqtl_calculate = True
# Analysis environment settings
parameter: container = ""
# For cluster jobs, number commands to run per job
parameter: job_size = 200
# Wall clock time expected
parameter: walltime = "1h"
# Memory expected
parameter: mem = "20G"
# Number of threads
parameter: numThreads = 1
if len(phenoFile) != len(covFile):
raise ValueError("Number of input phenotypes files must match that of covariates files")
if len(phenoFile) != len(phenotype_names):
raise ValueError("Number of input phenotypes files must match the number of phenotype names")
if len(phenoIDFile) > 0 and len(phenoFile) != len(phenoIDFile):
raise ValueError("Number of input phenotypes files must match the number of phenotype ID mapping files")
def group_by_region(lst, partition):
# from itertools import accumulate
# partition = [len(x) for x in partition]
# Compute the cumulative sums once
# cumsum_vector = list(accumulate(partition))
# Use slicing based on the cumulative sums
# return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
return partition
import os
import pandas as pd
def adapt_file_path(file_path, reference_file):
"""
Adapt a single file path based on its existence and a reference file's path.
Args:
- file_path (str): The file path to adapt.
- reference_file (str): File path to use as a reference for adaptation.
Returns:
- str: Adapted file path.
Raises:
- FileNotFoundError: If no valid file path is found.
"""
reference_path = os.path.dirname(reference_file)
# Check if the file exists
if os.path.isfile(file_path):
return file_path
# Check file name without path
file_name = os.path.basename(file_path)
if os.path.isfile(file_name):
return file_name
# Check file name in reference file's directory
file_in_ref_dir = os.path.join(reference_path, file_name)
if os.path.isfile(file_in_ref_dir):
return file_in_ref_dir
# Check original file path prefixed with reference file's directory
file_prefixed = os.path.join(reference_path, file_path)
if os.path.isfile(file_prefixed):
return file_prefixed
# If all checks fail, raise an error
raise FileNotFoundError(f"No valid path found for file: {file_path}")
def adapt_file_path_all(df, column_name, reference_file):
return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))
[get_analysis_regions: shared = "regional_data"]
# input is genoFile, phenoFile, covFile and optionally region_list. If region_list presents then we only analyze what's contained in the list.
# regional_data should be a dictionary like:
#{'data': [("genotype_1.bed", "phenotype_1.bed.gz", "covariate_1.gz"), ("genotype_2.bed", "phenotype_1.bed.gz", "phenotype_2.bed.gz", "covariate_1.gz", "covariate_2.gz") ... ],
# 'meta_info': [("chr12:752578-752579","chr12:752577-752580", "gene_1", "trait_1"), ("chr13:852580-852581","chr13:852579-852580", "gene_2", "trait_1", "trait_2") ... ]}
import numpy as np
def preload_id_map(id_map_files):
id_maps = {}
for id_map_file in id_map_files:
if id_map_file is not None and os.path.isfile(id_map_file):
df = pd.read_csv(id_map_file, sep=r'\s+', header=None, comment='#', names=['old_ID', 'new_ID'])
id_maps[id_map_file] = df.set_index('old_ID')['new_ID'].to_dict()
return id_maps
def load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps):
pheno_df = pd.read_csv(pheno_path, sep=r"\s+", header=0)
pheno_df['Original_ID'] = pheno_df['ID']
if id_map_path in preloaded_id_maps:
id_map = preloaded_id_maps[id_map_path]
pheno_df['ID'] = pheno_df['ID'].map(id_map).fillna(pheno_df['ID'])
return pheno_df
def filter_by_region_ids(data, region_ids):
if region_ids is not None and len(region_ids) > 0:
# Check both ID (mapped) and Original_ID (unmapped) to handle phenoIDFile cases
# where region_name uses the original ID but ID column has been mapped
if 'Original_ID' in data.columns:
data = data[data['ID'].isin(region_ids) | data['Original_ID'].isin(region_ids)]
else:
data = data[data['ID'].isin(region_ids)]
return data
def custom_join(series):
# Initialize an empty list to hold the processed items
result = []
for item in series:
if ',' in item:
# If the item contains commas, split by comma and convert to tuple
result.append(tuple(item.split(',')))
else:
# If the item does not contain commas, add it directly
result.append(item)
# Convert the list of items to a tuple and return
return tuple(result)
def aggregate_phenotype_data(accumulated_pheno_df):
if not accumulated_pheno_df.empty:
accumulated_pheno_df = accumulated_pheno_df.groupby(['#chr','ID','cond','path','cov_path'], as_index=False).agg({
'#chr': lambda x: np.unique(x).astype(str)[0],
'ID': lambda x: np.unique(x).astype(str)[0],
'Original_ID': ','.join,
'start': 'min',
'end': 'max'
}).groupby(['#chr','ID'], as_index=False).agg({
'cond': ','.join,
'path': ','.join,
'Original_ID': custom_join,
'cov_path': ','.join,
'start': 'min',
'end': 'max'
})
return accumulated_pheno_df
def process_cis_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, preloaded_id_maps):
'''
Example output:
#chr start end ID Original_ID path cov_path cond
chr12 752578 752579 ENSG00000060237 Q9H4A3,P62873 protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz covar_1.gz,covar_2.gz trait_A,trait_B
'''
accumulated_pheno_df = pd.DataFrame()
pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files
for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
if not os.path.isfile(cov_path):
raise FileNotFoundError(f"No valid path found for file: {cov_path}")
pheno_df = load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps)
pheno_df = filter_by_region_ids(pheno_df, region_ids)
if not pheno_df.empty:
# Use 'path' column by name to handle both 5-col (#chr,start,end,ID,path)
# and 6-col (#chr,start,end,ID,strand,path) region_list formats
pheno_df['path'] = adapt_file_path_all(pheno_df, 'path', f"{pheno_path:a}")
pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)
accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)
accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)
return accumulated_pheno_df
def process_trans_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, customized_association_windows):
'''
Example output:
#chr start end ID Original_ID path cov_path cond
chr21 0 0 chr21_18133254_19330300 carnitine,benzoate,hippurate metabolon_1.bed.gz,metabolon_2.bed.gz covar_1.gz,covar_2.gz trait_A,trait_B
'''
if not os.path.isfile(customized_association_windows):
raise ValueError("Customized association analysis window must be specified for trans analysis.")
accumulated_pheno_df = pd.DataFrame()
pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files
genotype_windows = pd.read_csv(customized_association_windows, comment="#", header=None, names=["#chr","start","end","ID"], sep="\t")
genotype_windows = filter_by_region_ids(genotype_windows, region_ids)
if genotype_windows.empty:
return accumulated_pheno_df
for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
if not os.path.isfile(cov_path):
raise FileNotFoundError(f"No valid path found for file: {cov_path}")
pheno_df = pd.read_csv(pheno_path, sep=r"\s+", header=0, names=['Original_ID', 'path'])
if not pheno_df.empty:
pheno_df.iloc[:, -1] = adapt_file_path_all(pheno_df, pheno_df.columns[-1], f"{pheno_path:a}")
pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)
# Here we combine genotype_windows which contains "#chr" and "ID" to pheno_df by creating a cartesian product
pheno_df = pd.merge(genotype_windows.assign(key=1), pheno_df.assign(key=1), on='key').drop('key', axis=1)
# then set start and end columns to zero
pheno_df['start'] = 0
pheno_df['end'] = 0
if id_map_path is not None:
# Filter pheno_df by specific association-window and phenotype pairs
association_analysis_pair = pd.read_csv(id_map_path, sep=r'\s+', header=None, comment='#', names=['ID', 'Original_ID'])
pheno_df = pd.merge(pheno_df, association_analysis_pair, on=['ID', 'Original_ID'])
accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)
accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)
return accumulated_pheno_df
# Load genotype meta data
if f"{genoFile:x}" == ".bed":
geno_meta_data = pd.DataFrame([("chr"+str(x), f"{genoFile:a}") for x in range(1,23)] + [("chrX", f"{genoFile:a}")], columns=['#chr', 'geno_path'])
else:
geno_meta_data = pd.read_csv(f"{genoFile:a}", sep = r"\s+", header=0)
geno_meta_data.iloc[:, 1] = adapt_file_path_all(geno_meta_data, geno_meta_data.columns[1], f"{genoFile:a}")
geno_meta_data.columns = ['#chr', 'geno_path']
geno_meta_data['#chr'] = geno_meta_data['#chr'].apply(lambda x: str(x) if str(x).startswith('chr') else f'chr{x}')
# Checking the DataFrame
valid_chr_values = [f'chr{x}' for x in range(1, 23)] + ['chrX']
if not all(value in valid_chr_values for value in geno_meta_data['#chr']):
raise ValueError("Invalid chromosome values found. Allowed values are chr1 to chr22 and chrX.")
region_ids = []
# If region_list is provided, read the file and extract IDs
if region_list.is_file():
region_list_df = pd.read_csv(region_list, sep=r'\s+', header=None, comment = "#")
region_ids = region_list_df.iloc[:, -1].unique() # Extracting the last column for IDs
# If region_name is provided, include those IDs as well
# --region-name A B C will result in a list of ["A", "B", "C"] here
if len(region_name) > 0:
region_ids = list(set(region_ids).union(set(region_name)))
trans_analysis = False
if trans_analysis:
meta_data = process_trans_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, customized_association_windows)
else:
meta_data = process_cis_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, preload_id_map(phenoIDFile))
if not meta_data.empty:
meta_data = meta_data.merge(geno_meta_data, on='#chr', how='inner')
# Adjust association-window
if os.path.isfile(customized_association_windows):
print(f"Loading customized association analysis window from {customized_association_windows}")
association_windows_list = pd.read_csv(customized_association_windows, comment="#", header=None, names=["#chr","start","end","ID"], sep="\t")
meta_data = pd.merge(meta_data, association_windows_list, on=['#chr', 'ID'], how='left', suffixes=('', '_association'))
mismatches = meta_data[meta_data['start_association'].isna()]
if not mismatches.empty:
raise ValueError(f"{len(mismatches)} regions to analyze cannot be found in ``{customized_association_windows}``. Please check your ``{customized_association_windows}`` database to make sure it contains all association-window definitions. ")
else:
if cis_window < 0 :
raise ValueError("Please either input valid path to association-window file via ``--customized-association-windows``, or set ``--cis-window`` to a non-negative integer.")
if cis_window == 0:
print("Warning: only variants within the range of start and end of molecular phenotype will be considered since cis_window is set to zero and no customized association window file was found. Please make sure this is by design.")
meta_data['start_association'] = meta_data['start'].apply(lambda x: max(x - cis_window, 0))
meta_data['end_association'] = meta_data['end'] + cis_window
# Example meta_data:
# #chr start end start_association end_association ID Original_ID path cov_path cond coordinate geno_path
# 0 chr12 752578 752579 652578 852579 ENSG00000060237 Q9H4A3,P62873 protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz covar_1.gz,covar_2.gz trait_A,trait_B chr12:752578-752579 protocol_example.genotype.chr21_22.bed
# Create the final dictionary
regional_data = {
'data': [(row['geno_path'], *row['path'].split(','), *row['cov_path'].split(',')) for _, row in meta_data.iterrows()],
'meta_info': [(f"{row['#chr']}:{row['start']}-{row['end']}", # this is the phenotypic region to extract data from
f"{row['#chr']}:{row['start_association']}-{row['end_association']}", # this is the association window region
row['ID'], row['Original_ID'], *row['cond'].split(',')) for _, row in meta_data.iterrows()]
}
else:
regional_data = {'data':[], 'meta_info':[]}
[quantile_qtl_twas_weight]
depends: sos_variable("regional_data")
# Check if both 'data' and 'meta_info' are empty lists
stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
meta_info = regional_data['meta_info']
input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
output: f'{cwd:a}/{step_name}/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.univariate_qr_twas_weights.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container
options(warn=1)
library(pecotmr)
library(qQTLR)
library(readr)
start_time_total <- proc.time()
phenotype_files = c(${",".join(['"%s"' % x.absolute() for x in _input[1:len(_input)//2+1]])})
covariate_files = c(${",".join(['"%s"' % x.absolute() for x in _input[len(_input)//2+1:]])})
conditions = c(${",".join(['"%s"' % x for x in _meta_info[4:]])})
region = ${("'%s'" % _meta_info[0]) if int(_meta_info[0].split('-')[-1])>0 else 'NULL'} # if the end position is zero return NULL
association_window = "${_meta_info[1]}"
extract_region_name = list(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])})
phenotype_header = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"}
region_name_col = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"}
# extract subset of samples
keep_samples = NULL
if (${"TRUE" if keep_samples.is_file() else "FALSE"}) {
keep_samples = unlist(strsplit(readLines(${keep_samples:ar}), "\\s+"))
message(paste(length(keep_samples), "samples are selected to be loaded for analysis"))
}
# Load variant filter list if provided
keep_variants_list = NULL
keep_variants_per_context = FALSE
keep_variants_region_data = NULL
if (${"TRUE" if keep_variants.is_file() else "FALSE"}) {
keep_variants_path = ${keep_variants:ar}
# Read input: RDS or text/tsv
if (grepl("\\.rds$", keep_variants_path, ignore.case = TRUE)) {
keep_variants_data = readRDS(keep_variants_path)
# Ensure it's a data.frame/data.table
if (!is.data.frame(keep_variants_data)) {
# If RDS contains a vector, treat as simple variant list
keep_variants_list = as.character(keep_variants_data)
keep_variants_list = keep_variants_list[keep_variants_list != ""]
message(paste(length(keep_variants_list), "variants are specified to be kept for analysis (from RDS vector)"))
keep_variants_data = NULL
}
} else {
keep_variants_data = data.table::fread(keep_variants_path, header = TRUE)
}
if (!is.null(keep_variants_data)) {
# Fix column name if first column is "#chr"
if ("#chr" %in% names(keep_variants_data)) {
names(keep_variants_data)[names(keep_variants_data) == "#chr"] <- "chr"
}
if (ncol(keep_variants_data) == 1) {
# Single column: treat as simple variant_id list (old behavior)
keep_variants_list = trimws(as.character(keep_variants_data[[1]]))
keep_variants_list = keep_variants_list[keep_variants_list != ""]
message(paste(length(keep_variants_list), "variants are specified to be kept for analysis"))
} else if (all(c("context", "molecular_trait_object_id", "variant_id") %in% names(keep_variants_data))) {
# Multi-column format: filter by context and molecular_trait_object_id
# For genotype loading: use union of all matching variants across all contexts
current_region = "${_meta_info[2]}"
original_region_ids = c(${",".join([("'%s'" % x) if isinstance(x, str) else ",".join(["'%s'" % i for i in x]) for x in _meta_info[3]])})
all_region_ids = unique(c(current_region, original_region_ids))
current_contexts = conditions
keep_variants_region_data = keep_variants_data[
keep_variants_data$molecular_trait_object_id %in% all_region_ids &
keep_variants_data$context %in% current_contexts, ]
keep_variants_list = unique(as.character(keep_variants_region_data$variant_id))
keep_variants_list = keep_variants_list[keep_variants_list != ""]
# Flag for per-context filtering in the loop
keep_variants_per_context = TRUE
message(paste(length(keep_variants_list), "variants matched for region", current_region,
"across", length(current_contexts), "contexts (per-context filtering enabled)"))
if (length(keep_variants_list) == 0) {
message("No matching variants found for this region, marginal coef will be skipped for contexts without variants")
}
} else {
# Fallback: treat first column as variant_id list
keep_variants_list = trimws(as.character(keep_variants_data[[1]]))
keep_variants_list = keep_variants_list[keep_variants_list != ""]
message(paste(length(keep_variants_list), "variants are specified to be kept for analysis (from first column)"))
}
}
}
# Load regional association data
tryCatch({
fdat = load_regional_association_data(genotype = ${_input[0]:anr},
phenotype = phenotype_files,
covariate = covariate_files,
region = ${("'%s'" % _meta_info[0]) if int(_meta_info[0].split('-')[-1])>0 else 'NULL'}, # if the end position is zero return NULL
association_window = "${_meta_info[1]}",
conditions = conditions,
maf_cutoff = ${maf},
mac_cutoff = ${mac},
imiss_cutoff = ${imiss},
keep_indel = ${"TRUE" if indel else "FALSE"},
keep_samples = keep_samples,
extract_region_name = list(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])}),
phenotype_header = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"},
region_name_col = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"},
scale_residuals = FALSE)
}, NoSNPsError = function(e) {
message("Error: ", paste(e$message, "${_meta_info[2] + '@' + _meta_info[1]}"))
saveRDS(list(${_meta_info[2]} = e$message), ${_output:ar}, compress='xz')
quit(save="no")
})
if (${"TRUE" if save_data else "FALSE"}) {
# save data object for debug purpose
saveRDS(list(${_meta_info[2]} = fdat), "${_output:ann}.univariate.rds", compress='xz')
}
# setup univariate analysis pipeline options
if ("${_meta_info[2]}" != "${_meta_info[3]}") {
region_name = c("${_meta_info[2]}", c(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])}))
} else {
region_name = "${_meta_info[2]}"
}
region_info = list(region_coord=parse_region("${_meta_info[0]}"), grange=parse_region("${_meta_info[1]}"), region_name=region_name)
fitted = list()
condition_names = vector()
r = 1
while (r<=length(fdat$Y)) {
X <- fdat$X_data[[r]]
Y <- fdat$Y[[r]]
if (is.null(dim(Y))) {
Y <- matrix(Y, nrow = length(Y), ncol = 1)
}
colnames(Y) <- extract_region_name[[r]]
Z <- fdat$covar[[r]]
# Update condition names first
new_names = names(fdat$residual_Y)[r]
new_col_names = list(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])})[[r]]
if (is.null(new_col_names)) {
new_col_names = 1:ncol(fdat$residual_Y[[r]])
}
if(!(identical(new_names, new_col_names))) {
new_names = paste(new_names, new_col_names, sep="_")
}
column_results <- lapply(1:ncol(Y), function(i) {
Y_col <- matrix(Y[,i], ncol=1)
colnames(Y_col) <- colnames(Y)[i]
# Determine per-context variant list
current_keep_variants = keep_variants_list
if (keep_variants_per_context && !is.null(keep_variants_region_data)) {
ctx_name = names(fdat$residual_Y)[r]
ctx_filtered = keep_variants_region_data[keep_variants_region_data$context == ctx_name, ]
if (nrow(ctx_filtered) > 0) {
current_keep_variants = unique(as.character(ctx_filtered$variant_id))
current_keep_variants = current_keep_variants[current_keep_variants != ""]
message(paste(length(current_keep_variants), "variants for context", ctx_name))
} else {
current_keep_variants = character(0)
message(paste("No context-specific variants for", ctx_name, "- skipping marginal coef fitting"))
}
}
qr_results = quantile_twas_weight_pipeline(
X = X,
Y = Y_col,
Z = Z,
ld_reference_meta_file=${('"%s"' % ld_reference_meta_file) if not ld_reference_meta_file.is_dir() else "NULL"},
maf = fdat$maf[[r]],
twas_maf_cutoff = ${min_twas_maf},
region_id = paste0(colnames(Y_col), "_", names(fdat$residual_Y)[r]),
quantile_qtl_tau_list = seq(0.05, 0.95, by = 0.05),
quantile_twas_tau_list = seq(0.01, 0.99, by = 0.01),
screen_method = "${screen_method}",
screen_threshold = ${screen_threshold},
screen_significant = ${"TRUE" if screen_significant else "FALSE"},
pre_filter_by_pqr = ${"TRUE" if pre_filter_by_pqr else "FALSE"},
initial_corr_filter_cutoff = ${initial_corr_filter_cutoff},
full_rank_corr_filter_cutoff = ${full_rank_corr_filter_cutoff},
keep_variants = current_keep_variants,
marginal_beta_calculate = ${"TRUE" if marginal_beta_calculate else "FALSE"},
twas_weight_calculate = ${"TRUE" if twas_weight_calculate else "FALSE"},
qrank_screen_calculate = ${"TRUE" if qrank_screen_calculate else "FALSE"},
vqtl_calculate = ${"TRUE" if vqtl_calculate else "FALSE"}
)
if (!is.null(qr_results$message)) {
message(qr_results$message)
}
qr_results$region_info = region_info
qr_results$maf = fdat$maf[[r]]
return(qr_results)
})
fitted <- c(fitted, column_results)
condition_names <- c(condition_names, new_names)
if (length(new_names) > 0) {
message("Analysis completed for: ", paste(new_names, collapse=","))
}
# Release memory
fdat$residual_X[[r]] <- NA
fdat$residual_Y[[r]] <- NA
r = r + 1
}
# Set names for the final results
if (length(fitted) > 0) {
names(fitted) <- condition_names
}
saveRDS(list("${_meta_info[2]}" = fitted), ${_output:ar}, compress='xz')
end_time_total <- proc.time()
total_time <- end_time_total - start_time_total
print(total_time)