QTL Association Testing

QTL Association Testing#

Description#

We perform QTL association testing using TensorQTL [cf. Taylor-Weiner et al (2019)].

Input#

List of molecular phenotype files: a list of bed.gz files containing the table for the molecular phenotype. It should have a companion index file in tbi format. It is the output of gene_annotation or phenotype_by_chorm
List of genotypes in both PLINK binary format (bed/bim/fam) and PLINK 2 binary genotype table (pgen/pvar/psam) for each chromosome, previously processed through our genotype QC pipelines.
Covariate file, a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.
Optionally, a list of traits (genes, regions of molecular features etc) to analyze.

Example phenotype list#

The header of the bed.gz is per the TensorQTL convention:

Phenotypes must be provided in BED format, sorted by chromosome and [start,end] position, with a single header line starting with # and the first four columns corresponding to: chr, start, end, phenotype_id, with the remaining columns corresponding to samples (the identifiers must match those in the genotype input). The BED file should specify the cis-window (usually the TSS), with start = the minimum start for each gene, end = the maximum end for each gene(extracted from phenotype referenced gtf file).

#chr	start	end	ID	path
chr1	631073	632616	ENSG00000237973	/home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
chr1	633695	634376	ENSG00000248527	/home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
chr1	825137	859446	ENSG00000228794	/home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
chr1	944202	959309	ENSG00000188976	/home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
chr1	975203	982093	ENSG00000187642	/home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
chr1	1216908	1232067	ENSG00000078808	/home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz

For cis-analysis:

Optionally, a list of genomic regions associate with each molecular features to analyze. The default cis-analysis will use a window around TSS. This can be customized to take given start and end genomic coordinates. we currently suggest using 1Mb window around a gene because longer customized cis-windows (such as extending by TAD) does not yield significant improvements.

For trans-analysis:

Computational strategy designed for trans analysis:

Trans analysis faces significant memory challenges as we calculate all associations between all molecular traits × all genetic variants across the genome, creating a massive computational burden. To address this challenge, we implement a two-stage chromosome-based parallelization approach:

Stage 1 (trans_1): Chromosome-based parallelization

Phenotype data is processed per chromosome (e.g., 22 separate jobs for autosomes)
For each phenotype chromosome, we test associations against variants from all 22 chromosomes
This creates phenotype_chr × genotype_chr combinations (e.g., phenotype chr1 vs genotype chr1-22); Garbage was collected between each chromosome combination caculation to release memory
Results are combined across all chromosome combinations and saved as compressed files

Stage 2 (trans_2): Significance filtering

Supports p-value cutoffs (--pvalue-cutoff) or q-value cutoffs (--qvalue-cutoff)

Output#

For each chromosome, several summary statistics files are generated, including both nominal test statistics for each test and region (gene) level association evidence.

Nominal Association Results#

The columns of the nominal association result are as follows:

chrom: Variant chromosome.
pos: Variant chromosomal position (basepairs).
molecular_trait_id: Molecular trait identifier (gene).
variant_id: ID of the variant (rsid or chr:position:ref:alt).
tss_distance: Distance of the SNP to the gene transcription start site (TSS).
tes_distance: Distance of the SNP to the gene transcription end site (TES).
cis_window_start_distance: Distance of the SNP to the start of the cis window (if using a customized cis window).
cis_window_end_distance: Distance of the SNP to the end of the cis window (if using a customized cis window).
af: The allele frequency of this SNP.
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
pvalue: Nominal P-value from linear regression.
bhat: Slope of the linear regression.
sebhat: Standard error of bhat.
n: Number of phenotypes after basic QC.

Multiple Testing Corrected Results:#

qvalue: Calculated q-value for each SNP (grouped by gene).

Interaction Association Results#

The columns of interaction association results are as follows (FIXME):

Model:
$$ \text{phenotype} = \beta_0 + \beta_1 \cdot \text{snp} + \beta_2 \cdot \text{msex} + \beta_3 \cdot (\text{snp} \times \text{msex}) + \epsilon $$

(Taking msex as the interaction factor)

chrom: Chromosome number.
pos: Variant chromosomal position (basepairs).
a2: Variant reference allele (A, C, T, or G).
a1: Variant alternate allele.
molecular_trait_id: Molecular trait identifier, varies from phenotypes to phenotypes.
variant_id: ID of the top variant (rsid or chr:position:ref:alt).
af: Alternative allele frequency in the MiGA cohort.
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
pvalue: P-value of the main effect from the nonlinear regression.
bhat: Slope of the main effect from the nonlinear regression.
se: Standard error of beta.
pvalue_msex: P-value of the msex term from the nonlinear regression.
bhat_msex: Slope of the msex term from the nonlinear regression.
se_msex: Standard error of bhat_msex.
pvalue_msex_interaction: P-value of the interaction term from the nonlinear regression.
bhat_msex_interaction: Slope of the interaction term from the nonlinear regression.
se_msex_interaction: Standard error of beta_msex_interaction.
molecular_trait_object_id: An intermediate ID (can be ignored).
n: Number of samples.

Multiple Testing Corrected Results:#

qvalue_main: The q-value of the main effect.
qvalue_interaction: The q-value of the interaction effect.

Region (Gene) Level Association Evidence#

The column specifications for region-level association evidence are as follows:

chrom: Chromosome number.
pos: Variant chromosomal position (basepairs).
n_variant: Total number of variants tested in cis.
beta_shape1: First parameter value of the fitted beta distribution.
beta_shape2: Second parameter value of the fitted beta distribution.
true_df: Effective degrees of freedom of the beta distribution approximation.
p_true_df: Empirical P-value for the beta distribution approximation.
variant_id: ID of the top variant (rsid or chr:position:ref:alt).
tss_distance: Distance of the SNP to the gene transcription start site (TSS).
tes_distance: Distance of the SNP to the gene transcription end site (TES).
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
af: Alternative allele frequency.
p_nominal: Nominal P-value from linear regression.
bhat: Slope of the linear regression.
sehat: Standard error of the bhat.
p_perm: First permutation P-value directly obtained from the permutations with the direct method.
p_beta: Second permutation P-value obtained via beta approximation (this is the one to use for downstream analysis).
molecular_trait_object_id: Molecular trait identifier (gene).
n_traits: Group size in the permutation test.
genomic_inflation: Genomic inflation factor (lambda), quantifying the extent of bulk inflation and the excess false positive rate.

Multiple Testing Corrected Results:#

q_beta: Q-value for p_beta using Storey’s method (qvalue), more conservative than FDR.
q_perm: Q-value for p_perm using Storey’s method (qvalue), more conservative than FDR.
fdr_beta: Adjusted P-value for p_beta using the Benjamini-Hochberg method (FDR).
fdr_perm: Adjusted P-value for p_perm using the Benjamini-Hochberg method (FDR).
p_nominal_threshold: Nominal p-value threshold for variants in the corresponding molecular trait, derived from empirical beta distribution as a result of permutation testing.

Minimal Working Example Steps#

The data can be found on Synapse.

i. Cis TensorQTL Command#

sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file output/genotype_by_chrom/wgs.merged.plink_qc.genotype_by_chrom_files.txt \
    --phenotype-file output/phenotype/phenotype_by_chrom/bulk_rnaseq.phenotype_by_chrom_files.txt \
    --covariate-file output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz \
    --customized-cis-windows reference_data/TAD/TADB_enhanced_cis.bed \
    --cwd output/tensorqtl_cis/ \
    --MAC 5 

INFO: Running [32mcis_1[0m: 
INFO: [32mcis_1[0m (index=6) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=7) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=3) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=5) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=2) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=1) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=0) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=4) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=11) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=13) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=9) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=8) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=12) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=15) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=10) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=14) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=19) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=17) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=16) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=20) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=21) is [32mcompleted[0m.
INFO: [32mcis_1[0m (index=18) is [32mcompleted[0m.
INFO: [32mcis_1[0m output:   [32m/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/tensorqtl_cis/bulk_rnaseq.chr20.cis_qtl_pairs.20.parquet /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/tensorqtl_cis/bulk_rnaseq.chr20_chr20.cis_qtl.pairs.tsv.gz... (66 items in 22 groups)[0m
INFO: Running [32mcis_2[0m: 
INFO: [32mcis_2[0m is [32mcompleted[0m.
INFO: [32mcis_2[0m output:   [32moutput/tensorqtl_cis/bulk_rnaseq.cis_qtl_regional_significance.tsv.gz output/tensorqtl_cis/bulk_rnaseq.cis_qtl_regional_significance.summary.txt[0m
INFO: Workflow cis (ID=wf24d8ec17aef888e) is executed successfully with 2 completed steps and 23 completed substeps.

ii. Trans TensorQTL Command#

sos run pipeline/TensorQTL.ipynb trans \
    --genotype-file data/wgs.merged.plink_qc.genotype_trans_files.txt \
    --phenotype-file output/phenotype/phenotype_by_chrom_for_trans/bulk_rnaseq.phenotype_by_chrom_files.txt \
    --region-list data/combined_AD_genes.csv \
    --region-list-phenotype-column 4 \
    --covariate-file output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --cwd output/tensorqtl_trans/ \
    --MAC 5 

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command Interface#

sos run TensorQTL.ipynb -h

usage: sos run TensorQTL.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  cis
  trans

Global Workflow Options:
  --cwd output (as path)
                        Path to the work directory of the analysis.
  --phenotype-file VAL (as path, required)
                        Phenotype file, or a list of phenotype per region.
  --genotype-file VAL (as path, required)
                        A genotype file in PLINK binary format (bed/bam/fam)
                        format, or a list of genotype per chrom
  --covariate-file VAL (as path, required)
                        Covariate file
  --name  f"{phenotype_file:bn}_{covariate_file:bn}"

                        Prefix for the analysis output
  --region-list . (as path)
                        An optional subset of regions of molecular features to
                        analyze. The last column is the gene names
  --region-list-phenotype-column 4 (as int)
  --keep-sample . (as path)
                        Set list of sample to be keep
  --interaction ''
                        FIXME: please document
  --customized-cis-windows . (as path)
                        An optional list documenting the custom cis window for
                        each region to analyze, with four column, chr, start,
                        end, region ID (eg gene ID). If this list is not
                        provided, the default `window` parameter (see below)
                        will be used.
  --phenotype-group . (as path)
                        The phenotype group file to group molecule_trait into
                        molecule_trait_object This applies to multiple molecular
                        events in the same region, such as sQTL analysis.
  --chromosome  (as list)
                        The name of phenotype corresponding to gene_id or
                        gene_name in the region
  --MAC 0 (as int)
                        Minor allele count cutoff
  --window 1000000 (as int)
                        Specify the cis window for the up and downstream radius
                        to analyze around the region of interest in units of bp
                        This parameter will be set to zero if
                        `customized_cis_windows` is provided.
  --numThreads 8 (as int)
                        Number of threads
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 12h
  --mem 16G
  --container ''
                        Container option for software to run the analysis:
                        docker or singularity
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

  --maf-threshold  MAC/(2.0*N) 

                        Minor allele frequency cutoff. It will overwrite minor
                        allele cutoff. You may consider setting it to higher for
                        interaction analysis if you have statistical power
                        concerns

Sections
  cis_1:
    Workflow Options:
      --[no-]skip-nominal-if-exist (default to False)
                        parse input file lists skip nominal association results
                        if the files exists already This is false by default
                        which means to recompute everything This is only
                        relevant when the `parquet` files for nominal results
                        exist but not the other files and you want to avoid
                        computing the nominal results again
      --[no-]permutation (default to True)
  cis_2:
  trans_1:
    Workflow Options:
      --batch-size 50000 (as int)
      --pval-threshold 1.0 (as float)
      --[no-]permutation (default to False)
                        Permutation testing is incorrect when the analysis is
                        done by chrom

Old Minimal working example#

An MWE is uploaded to google drive. The singularity image (sif) for running this MWE is uploaded to google drive

FIXME: need to update these links.

FIXME: Also need to update the example commands below using our new example dataset.

sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file plink_files_list.txt \
    --phenotype-file MWE.bed.recipe \
    --covariate-file ALL.covariate.pca.BiCV.cov.gz \
    --cwd ./output/ \
    --MAC 5

sos run pipeline/TensorQTL.ipynb trans \
    --genotype-file plink_files_list.txt \
    --phenotype-file MWE.bed.recipe \
    --covariate-file ALL.covariate.pca.BiCV.cov.gz \
    --cwd ./output/ \
    --MAC 5 --region-name  gene_name

Setup and global parameters#

[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Phenotype file, or a list of phenotype per region.
parameter: phenotype_file = path
# A genotype file in PLINK binary format (bed/bam/fam) format, or a list of genotype per chrom
parameter: genotype_file = path
# Covariate file
parameter: covariate_file = path
# Optional pattern to filter covariates (list of covariate prefixes or exact names)
parameter: covariate_pattern = []
# Prefix for the analysis output
parameter: name = ""
# An optional subset of regions of molecular features to analyze. The last column is the gene names
parameter: region_list = path()
parameter: region_list_phenotype_column = 4
# Set list of sample to be keep
parameter: keep_sample = path()
# FIXME: please document
parameter: interaction = ""

# An optional list documenting the custom cis window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_cis_windows = path()

# The phenotype group file to group molecule_trait into molecule_trait_object
# This applies to multiple molecular events in the same region, such as sQTL analysis.
parameter: phenotype_group = path() 

# The name of phenotype corresponding to gene_id or gene_name in the region
parameter: chromosome = []
# Minor allele count cutoff
parameter: MAC = 0

# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# This parameter will be set to zero if `customized_cis_windows` is provided.
parameter: window = 1000000

# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
import re

# Use the header of the covariate file to decide the sample size
import pandas as pd
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1

# Minor allele frequency cutoff. It will overwrite minor allele cutoff.
# You may consider setting it to higher for interaction analysis if you have statistical power concerns
parameter: maf_threshold = MAC/(2.0*N)

# Filtering significant trans associations (for trans_2 workflow)
parameter: pvalue_cutoff = "5e-8"
parameter: qvalue_cutoff = ""


import os
import pandas as pd

def adapt_file_path(file_path, reference_file):
    """
    Adapt a single file path based on its existence and a reference file's path.

    Args:
    - file_path (str): The file path to adapt.
    - reference_file (str): File path to use as a reference for adaptation.

    Returns:
    - str: Adapted file path.

    Raises:
    - FileNotFoundError: If no valid file path is found.
    """
    reference_path = os.path.dirname(reference_file)

    # Check if the file exists
    if os.path.isfile(file_path):
        return file_path

    # Check file name without path
    file_name = os.path.basename(file_path)
    if os.path.isfile(file_name):
        return file_name

    # Check file name in reference file's directory
    file_in_ref_dir = os.path.join(reference_path, file_name)
    if os.path.isfile(file_in_ref_dir):
        return file_in_ref_dir

    # Check original file path prefixed with reference file's directory
    file_prefixed = os.path.join(reference_path, file_path)
    if os.path.isfile(file_prefixed):
        return file_prefixed

    # If all checks fail, raise an error
    raise FileNotFoundError(f"No valid path found for file: {file_path}")

def adapt_file_path_all(df, column_name, reference_file):
    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))


if (str(genotype_file).endswith("bed") or str(genotype_file).endswith("pgen")) and str(phenotype_file).endswith("bed.gz"):
    input_files = [[phenotype_file, genotype_file]]
    if len(chromosome) > 0:
        input_chroms = [int(x) for x in chromosome]
    else:
        input_chroms = [0]
else:
    import pandas as pd
    import os
    molecular_pheno_files = pd.read_csv(phenotype_file, sep = "\t")
    
    if "#dir" in molecular_pheno_files.columns and "#chr" not in molecular_pheno_files.columns:
        molecular_pheno_files = molecular_pheno_files.rename(columns={"#dir": "path"})
        if "#id" in molecular_pheno_files.columns:
            molecular_pheno_files = molecular_pheno_files.rename(columns={"#id": "#chr"})
    
    if "#chr" in molecular_pheno_files.columns:
        molecular_pheno_files = molecular_pheno_files.groupby(['#chr','path']).size().reset_index(name='count').drop("count",axis = 1).rename(columns = {"#chr":"#id"})
    genotype_files = pd.read_csv(genotype_file,sep = "\t")
    genotype_files["#id"] = [x.replace("chr","") for x in genotype_files["#id"].astype(str)] # e.g. remove chr1 to 1
    genotype_files["#path"] = genotype_files["#path"].apply(lambda x: adapt_file_path(x, genotype_file))
    molecular_pheno_files["#id"] = [x.replace("chr","") for x in molecular_pheno_files["#id"].astype(str)]
    input_files = molecular_pheno_files.merge(genotype_files, on = "#id")
    
    # Only keep chromosome specified in --chromosome
    if len(chromosome) > 0:
        input_files = input_files[input_files['#id'].isin(chromosome)]
    input_files = input_files.values.tolist()
    input_chroms = [x[0] for x in input_files]
    input_files = [x[1:] for x in input_files]
    if len(name) == 0:
        name = f'{path(input_files[0][0]):bnn}' if len(input_files) == 1 else f'{path(input_files[0][0]):bnnn}'

cis-xQTL association testing#

[cis_1]
# parse input file lists
# skip nominal association results if the files exists already
# This is false by default which means to recompute everything
# This is only relevant when the `parquet` files for nominal results exist but not the other files
# and you want to avoid computing the nominal results again
parameter: skip_nominal_if_exist = False
parameter: permutation = True

# Extract interaction name
var_interaction = interaction
if os.path.isfile(interaction):
    interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
    var_interaction = interaction_s.columns[0] # interaction name
test_regional_association = permutation and len(var_interaction) == 0

input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output_files = dict([("parquet", f'{cwd:a}/{_input[0]:bnn}{"_%s" % var_interaction if interaction else ""}.cis_qtl_pairs.{"" if input_chroms[_index] == 0 else input_chroms[_index]}.parquet'), # This convention is necessary to match the pattern of map_norminal output
                     ("nominal", f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.pairs.tsv.gz')])
if test_regional_association:
    output_files["regional"] = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.regional.tsv.gz'
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output["nominal"]:bnnn}'
python: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout' , container = container
    import pandas as pd
    import numpy as np
    import os
    import tensorqtl
    from tensorqtl import genotypeio, cis
    from scipy.stats import chi2
    import multiprocessing as mp
    from tqdm import tqdm

    ## Define paths
    plink_prefix_path = $[_input[1]:nar]
    expression_bed = $[_input[0]:ar]

    covariates_file = "$[covariate_file:a]"
    window = $[window]
    interaction = "$[interaction]"
    ## Load Data
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
    phenotype_id = phenotype_pos_df.index.name

    ## Analyze only the regions listed
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list:a]", comment="#", header=None, sep="\t" )
        phenotype_column = 1 if len(region.columns) == 1 else  $[region_list_phenotype_column]
        keep_region = region.iloc[:,phenotype_column-1].to_list()
        phenotype_df = phenotype_df[phenotype_df.index.isin(keep_region)]
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_region)]
    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
    genotype_df, variant_df = genotypeio.load_genotypes(plink_prefix_path, dosages = True)

    
    ## use custom sample list to subset the covariates data
    if $[keep_sample.is_file()]:
        sample_list = pd.read_csv("$[keep_sample:a]", comment="#", header=None, names=["sample_id"], sep="\t")
        covariates_df.loc[sample_list.sample_id]

    # Read interaction files or extract from covariates file.
    var_interaction = interaction
    interaction_s = []
    if os.path.isfile(interaction):
        # update var_interaction and interaction_s
        interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
        interaction_s = interaction_s[interaction_s.index.isin(covariates_df.index)] 
        var_interaction = interaction_s.columns[0] # interaction name
    # check if the interaction term in interaction table is in covariates file, if yes and interaction_s not yet loaded then, extract it out from covariates file
    if var_interaction in covariates_df.columns:
        # only load from covariate if it has not been loaded yet
        if len(interaction_s) == 0:
            interaction_s = covariates_df[var_interaction].to_frame()
        covariates_df = covariates_df.drop(columns=[var_interaction])
    if len(interaction) and len(interaction_s) == 0:
        raise ValueError(f"Cannot find interaction variable or file {interaction}")

    # drop samples that with missing value in iteraction
    if len(interaction_s):
        interaction_s = interaction_s.dropna() 

    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)] 

    if len(interaction_s):
        phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, interaction_s.index)]
        interaction_s = interaction_s[interaction_s.index.isin(phenotype_df.columns)]    
        interaction_s = interaction_s.loc[phenotype_df.columns]

    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    
    ## To simplify things, there should really not be "chr" prefix
    phenotype_pos_df.chr = phenotype_pos_df.chr.astype(str).str.replace("chr", "")
    variant_df.chrom =  variant_df.chrom.astype("str").str.replace("chr", "") 

    ## use custom cis windows list
    if $[customized_cis_windows.is_file()]:
        cis_list = pd.read_csv("$[customized_cis_windows:a]", comment="#", header=None, names=["chr","start","end",phenotype_id], sep="\t")
        cis_list.chr = cis_list.chr.astype(str).str.replace("chr", "")  ## Again to simplify things for chr format concordance.
        if cis_list[['chr', 'ID']].duplicated().sum() != 0: # if cis_list is not unique using identifier ['#chr', 'ID']
                cis_list = cis_list.groupby(['ID', 'chr']).agg({ # use union start-end position and make cis_list unique
                    'start': 'min',
                    'end': 'max'
                }).reset_index()

                cis_list = cis_list[['chr', 'start', 'end', 'ID']]
                #cis_list = cis_list.set_index('chr')
        
        phenotype_pos_df = phenotype_pos_df.reset_index() #move the phenotype id index to a new column of the dataframe
        phenotype_df = phenotype_df.reset_index()
        # Ensure phenotype_pos_df is a subset of cis_list based on ['#chr', 'ID']
        original_count = len(phenotype_pos_df)
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.set_index(['chr', 'ID']).index.isin(cis_list.set_index([cis_list.columns[0],cis_list.columns[3]]).index)] 
        phenotype_df = phenotype_df[phenotype_df.set_index('ID').index.isin(cis_list.set_index(cis_list.columns[3]).index)] 
        phenotype_df = phenotype_df.set_index('ID')
        removed_count = original_count - len(phenotype_pos_df)
        print(f"{removed_count} rows were removed from phenotype_pos_df")

        # Merge the dataframes on 'chr' and 'ID', including 'start' and 'end'
        phenotype_pos_df = phenotype_pos_df.merge(cis_list[['chr', 'ID', 'start', 'end']], 
                                        left_on = ["chr",phenotype_id],
                                        right_on = [cis_list.columns[0],cis_list.columns[3]],
                                        suffixes=('_pheno', ''))

        # Function to decide whether to keep or rename columns: 
        # If 'start' and 'end' values are the same in both dataframes, we keep the original columns without suffixes; 
        # If they're different, we name columns from cis_list with '_cis' suffixes and keep the name still for columns from phentype_pos_df.
        # #in some cases (gene expression for eQTLs) the phenotype_id may be in the cis_list file
        def rename_if_different(row, col):
            if row[f'{col}'] == row[f'{col}_pheno']:
                return row[col]
            else:
                return pd.Series({f'{col}_pheno': row[f'{col}_pheno'], f'{col}_cis': row[col]})
        
        # Apply the renaming logic
        for col in ['start', 'end']:
            if f'{col}_pheno' in phenotype_pos_df.columns:# this condition is to not execute when the original phenotype_pos_df don't have start, end but pos column.
                temp = phenotype_pos_df.apply(lambda row: rename_if_different(row, col), axis=1)
                phenotype_pos_df = phenotype_pos_df.drop(columns=[f'{col}_pheno']).join(temp)
                print(f"Dropped columns due to value mismatch: {col}_pheno")
        
        phenotype_pos_df = phenotype_pos_df.set_index(phenotype_id)[["chr","start","end"]] # The final phenotype_pos_df will have three columns(chr, start, end) and index is the phenotype ID
        
        if len(phenotype_df.index) != len(phenotype_pos_df.index):
            raise ValueError("cannot uniquely match all the phentoype data in the input to the customized cis windows provided")
        window = 0 # In the updated tensorQTL, by default if there is a customized cis window, the actual cis window will be start - window & end + window, so it is necessary to change the window parameter to 0

    ## Read phenotype group if availble
    if $[phenotype_group.is_file()]:
        group_s = pd.read_csv($[phenotype_group:r], sep='\t', header=None, index_col=0).squeeze()
    else:
        group_s = None

    ## cis-QTL mapping: nominal associations for all variant-phenotype pairs
    if not ($[skip_nominal_if_exist] and $[_output["parquet"].is_file()]):
        if len(interaction_s):
            cis.map_nominal(genotype_df, variant_df, 
                    phenotype_df, 
                    phenotype_pos_df, 
                    $[_output["parquet"]:nnnr],
                    covariates_df=covariates_df,
                    interaction_df=interaction_s, 
                    maf_threshold_interaction=$[maf_threshold],
                    window=window,
                    group_s=group_s,
                    run_eigenmt=True)
        else:
            cis.map_nominal(genotype_df, variant_df,
                phenotype_df,
                phenotype_pos_df,
                $[_output["parquet"]:nnnr],
                covariates_df=covariates_df, 
                window=window, 
                maf_threshold=$[maf_threshold],
                run_eigenmt=$['False' if permutation else 'True'],
                group_s=group_s)

    ## Load the parquet and save it as txt
    pairs_df = pd.read_parquet($[_output["parquet"]:r])
    ## Remove rows whose 'pval_gi' is null for following t_pval_conversion
    if len(interaction_s):
        pairs_df = pairs_df.dropna(subset=['pval_gi'])
    # print general information of parquet
    print('Output Information:')
    print("This is the file containing the immediate output of TensorQTL's map_nominal function ")
    print(os.path.getsize($[_output["parquet"]:r]))

    ## Adds the group columns to pairs_df, if there is group_s use group_s, else use phenotype_id
    if group_s is not None:
        pairs_df = pairs_df.merge(pd.DataFrame( {"molecular_trait_object_id": group_s}),left_on = "phenotype_id", right_index = True)
    else:
        pairs_df["molecular_trait_object_id"] = pairs_df.phenotype_id
    ## if pos in phenotype_pos_df(start distance and end distance are the same), 
    ## add the column 'end_distance' with the same value as 'start_distance' to avoid mismatch of column and names
    if 'end_distance' not in pairs_df.columns:
        # Get the position of 'start_distance' column
        start_pos = pairs_df.columns.get_loc('start_distance')
        # Create a new DataFrame with the same values as start_distance
        new_df = pairs_df.copy()
        # Insert end_distance column after start_distance
        new_df.insert(start_pos + 1, 'end_distance', pairs_df['start_distance'])
        pairs_df = new_df

    # rename columns
    column_map = {'phenotype_id': 'molecular_trait_id'}
   
    if len(interaction_s):
        # calculate genomic inflation factor lambda on interaction 
        lambda_col_interaction = pairs_df.groupby("molecular_trait_object_id").apply(lambda x: chi2.ppf(1. - np.median(x.pval_gi), 1)/chi2.ppf(0.5,1))
        column_map.update({
            'pval_g': 'pvalue', 'b_g': 'bhat', 'b_g_se': 'se',
            'pval_i': f'pvalue_{var_interaction}',
            'b_i': f'bhat_{var_interaction}',
            'b_i_se': f'sebhat_{var_interaction}',
            'pval_gi': f'pvalue_{var_interaction}_interaction',
            'b_gi': f'bhat_{var_interaction}_interaction',
            'b_gi_se': f'sebhat_{var_interaction}_interaction'
        })
    else:
        column_map.update({'pval_nominal': 'pvalue', 'slope': 'bhat', 'slope_se': 'sebhat'})
    if $[customized_cis_windows.is_file()]:
        column_map.update({'start_distance': 'cis_window_start_distance', 'end_distance': 'cis_window_end_distance'})
    else:
        column_map.update({'start_distance': 'tss_distance', 'end_distance': 'tes_distance'})
    pairs_df.rename(columns=column_map, inplace=True)
    pairs_df["n"] = len(phenotype_df.columns.values)
    pairs_df = variant_df.merge(pairs_df, right_on='variant_id', left_index=True)
    pairs_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
    # sort the table if chrom and pos is not in ascending order
    if not all(pairs_df['pos'].iloc[i] <= pairs_df['pos'].iloc[i+1] for i in range(len(pairs_df)-1)):
        pairs_df = pairs_df.sort_values(by=['chrom', 'pos'])
    # save file
    pairs_df.to_csv($[_output["nominal"]:nr], sep='\t', index = None)
    # print general information of pairs_df
    print('Output Information:')
    print("Output Rows:", len(pairs_df))
    print("Output Columns:", pairs_df.columns.tolist())
    print("Output Preview:", pairs_df.iloc[1:5, 1:10])

    if $[test_regional_association]:
        # calculate genomic inflation factor lambda for main variant effect 
        lambda_col = pairs_df.groupby("molecular_trait_object_id").apply(lambda x: chi2.ppf(1. - np.median(x.pvalue), 1)/chi2.ppf(0.5,1))
        cis_df = cis.map_cis(genotype_df, 
                            variant_df, 
                            phenotype_df,
                            phenotype_pos_df,
                            covariates_df=covariates_df, 
                            seed=999, 
                            window=window, 
                            maf_threshold = $[maf_threshold],
                            group_s=group_s)
        cis_df.index.name = "molecular_trait_id"
        ## Add groups columns for eQTL analysis
        if "group_id" not in cis_df.columns:
            cis_df["group_id"] = cis_df.index
            cis_df["group_size"] = 1
        cis_df.rename(columns={"group_id": "molecular_trait_object_id", "group_size": "n_traits", 
                        'start_distance': 'tss_distance', 'end_distance': 'tes_distance',
                        "num_var": "n_variants", "pval_nominal": "p_nominal", 
                        'slope': 'bhat', 'slope_se': 'sebhat',
                        "pval_true_df": "p_true_df", "pval_perm": "p_perm", "pval_beta": "p_beta"}, inplace = True)
        cis_df = cis_df.assign(genomic_inflation = lambda dataframe : dataframe["molecular_trait_object_id"].map(lambda molecular_trait_object_id:lambda_col[molecular_trait_object_id]))
        # merge cis_df with variant_df
        cis_df = variant_df.merge(cis_df, right_on='variant_id', left_index=True)
        cis_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
        # sort the table if chrom and pos is not in ascending order
        if not all(cis_df['pos'].iloc[i] <= cis_df['pos'].iloc[i+1] for i in range(len(cis_df)-1)):
            cis_df = cis_df.sort_values(by=['chrom', 'pos'])
        # save file
        cis_df.to_csv(str($[_output["nominal"]:nnnr])+str('.regional.tsv'), sep='\t', index = None)
        # print general information of cis_df
        print('Output Information:')
        print("Output Rows:", len(cis_df))
        print("Output Columns:", cis_df.columns.tolist())
        print("Output Preview:", cis_df.iloc[0:5, 0:10])

R: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
    library(purrr)
    library(tidyr)
    library(readr)
    library(dplyr)
    library(qvalue)
  
    pairs_df = read_delim($[_output["nominal"]:nr,], delim = '\t')
    compute_qvalues <- function(pvalues) {
        tryCatch({
            if(length(pvalues) < 2) {
                return(pvalues)
            } else {
                return(qvalue(pvalues)$qvalues)
            }
        }, error = function(e) {
            message("Too few p-values to calculate qvalue, fall back to BH")
            qvalue(pvalues, pi0 = 1)$qvalues
        })
    }
    
    var_interaction <- "$[interaction]"
    # Check if 'interaction' is a file
    if (file.exists(var_interaction)) {
        # Read the file into 'interaction_s' dataframe
        interaction_s <- read.delim(var_interaction, row.names = 1)
        # Update 'var_interaction' to the first column name of 'interaction_s'
        var_interaction <- names(interaction_s)[1]
    }

    if (is.null(var_interaction) || var_interaction == "") {
        pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue = compute_qvalues(pvalue))
    } else {
        pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue_main = compute_qvalues(pvalue), qvalue_interaction = compute_qvalues($["pvalue_%s_interaction" % var_interaction]))         
    }

    pairs_df %>% write_delim($[_output["nominal"]:nr],"\t")
  
bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
        bgzip --compress-level 9 $[_output["nominal"]:n] 
        tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]]

done_if(not test_regional_association)

bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
        bgzip --compress-level 9 $[_output["nominal"]:nnn].regional.tsv
        tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]:nnn].regional.tsv.gz

[cis_2]
done_if("regional" not in _input.labels)
input: group_by = "all"
output_file_prefix = name if len(_input["nominal"]) > 1 else f'{_input["nominal"][0]:bnnnn}'
output: f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.tsv.gz',
        f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.summary.txt'
input_files = [str(x) for x in _input["regional"]]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    library("purrr")
    library("tidyr")
    library("dplyr")
    library("readr")
    library("qvalue")
    emprical_pd = tibble(map(c($[_input["regional"]:r,]), ~read_delim(.x,"\t")))%>%unnest()
    emprical_pd["q_beta"] = tryCatch(qvalue(emprical_pd$p_beta)$qvalue, error = function(e){print("Too few pvalue to calculate qvalue, fall back to BH") 
                                                                                              qvalue(emprical_pd$p_beta,pi0 = 1 )$qvalue})  

    emprical_pd["q_perm"] = tryCatch(qvalue(emprical_pd$p_perm)$qvalue, error = function(e){print("Too few pvalue to calculate qvalue, fall back to BH") 
                                                                                              qvalue(emprical_pd$p_perm,pi0 = 1 )$qvalue})
    emprical_pd["fdr_beta"] = p.adjust(emprical_pd$p_beta,"fdr")    
    emprical_pd["fdr_perm"] = p.adjust(emprical_pd$p_perm,"fdr")   


    # Calculate the global nominal p-value threshold based on q_beta at FDR 0.05
    if (!all(is.na(emprical_pd$p_beta))) {
      lb <- emprical_pd %>% 
        filter(q_beta <= 0.05) %>% 
        pull(p_beta) %>% 
        sort()
      
      ub <- emprical_pd %>% 
        filter(q_beta > 0.05) %>% 
        pull(p_beta) %>% 
        sort()
      
      if (length(lb) > 0) {
        lb_val <- tail(lb, 1)
        threshold <- if (length(ub) > 0) (lb_val + head(ub, 1)) / 2 else lb_val
        message(sprintf("min p-value threshold @ FDR 0.05: %g", threshold))
        
        emprical_pd <- emprical_pd %>% 
          mutate(p_nominal_threshold = qbeta(threshold, beta_shape1, beta_shape2))
      }
    }

    summary = tibble("fdr_perm_0.05" =  sum(emprical_pd["fdr_perm"] < 0.05), 
                      "fdr_beta_0.05" = sum(emprical_pd["fdr_beta"] < 0.05),
                      "q_perm_0.05" = sum(emprical_pd["q_perm"] < 0.05),
                      "q_beta_0.05" = sum(emprical_pd["q_beta"] < 0.05),
                      "fdr_perm_0.01" =  sum(emprical_pd["fdr_perm"] < 0.01), 
                      "fdr_beta_0.01" = sum(emprical_pd["fdr_beta"] < 0.01),
                      "q_perm_0.01" = sum(emprical_pd["q_perm"] < 0.01),
                      "q_beta_0.01" = sum(emprical_pd["q_beta"] < 0.01)  )
    emprical_pd%>%write_delim("$[_output[0]]","\t")
    summary%>%write_delim("$[_output[1]]","\t")

Trans-xQTL association testing#

For transQTL analysis, if you output all the p-values for many genes (default setting) it is suggested to provide the largest memory and CPU threads available on a compute node. eg 250G and >32 threads.

[trans_1]
input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output: nominal = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_%s" % input_chroms[_index]}.trans_qtl.pairs.tsv.gz'

parameter: batch_size = 10000
parameter: pval_threshold = 1.0
# Permutation testing is incorrect when the analysis is done by chrom
parameter: permutation = False

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container =container
    import pandas as pd
    import numpy as np
    import tensorqtl
    from tensorqtl import genotypeio, trans
    from scipy.stats import chi2
    import gc
    import os
    from statsmodels.stats.multitest import multipletests

    ## Define paths
    plink_prefix_path = $[_input[1]:nar]
    expression_bed = $[_input[0]:ar]
    covariates_file = "$[covariate_file:a]"
    window = $[window]
    current_chrom = "$[input_chroms[_index]]" if "$[input_chroms[_index]]" != "0" else None
    
    print(f"Processing with output name: {current_chrom}")
    phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
    phenotype_id = phenotype_pos_df.index.name


    ## Analyze only the regions listed
    if $[region_list.is_file()]:
        region = pd.read_csv("$[region_list:a]")
        phenotype_column = 1 if len(region.columns) == 1 else $[region_list_phenotype_column]
        keep_region = region.iloc[:, phenotype_column-1].astype(str).str.strip().to_list()
        phenotype_df = phenotype_df[phenotype_df.index.isin(keep_region)]
        phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_region)]


    ## use custom cis windows
    if $[customized_cis_windows.is_file()]:
        cis_list = pd.read_csv("$[customized_cis_windows:a]", comment="#", header=None, names=["chr","start","end",phenotype_id], sep="\t")
        phenotype_pos_df_reset = phenotype_pos_df.reset_index()
        phenotype_pos_df = phenotype_pos_df_reset.merge(cis_list, left_on=["chr",phenotype_id], right_on=[cis_list.columns[0],cis_list.columns[3]])
        if len(phenotype_df.index) - phenotype_pos_df_reset[~phenotype_pos_df_reset[phenotype_id].isin(cis_list[phenotype_id])].shape[0]!= len(phenotype_pos_df.index):
            raise ValueError("cannot uniquely match all the phentoype data in the input to the customized cis windows provided")
        phenotype_pos_df = phenotype_pos_df.set_index(phenotype_id)[["chr","start","end"]]
        window = 0
        if phenotype_pos_df_reset[~phenotype_pos_df_reset[phenotype_id].isin(cis_list[phenotype_id])].shape[0] != 0:
            phenotype_df = phenotype_df.loc[phenotype_df.index.isin(cis_list[phenotype_id])]

    covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T


    ## Filter covariates based on covariate_pattern if provided
    covariate_pattern_list = $[covariate_pattern]
    if covariate_pattern_list:
        print(f"Filtering covariates using pattern: {covariate_pattern_list}")
        pattern_mapping = {
            "pheno_PC": ["Hidden_Factor_PC"],  # Map pheno_PC to columns starting with Hidden_Factor_PC
            "geno_PC": ["PC"]                 # Map geno_PC to columns starting with PC
        }
        
        keep_cols = []
        for col in covariates_df.columns:
            if col in covariate_pattern_list:
                keep_cols.append(col)
                continue
        
            for pattern in covariate_pattern_list:
                if pattern in pattern_mapping:
                    # For special patterns like pheno_PC and geno_PC, check their mappings
                    for mapped_pattern in pattern_mapping[pattern]:
                        if col.startswith(mapped_pattern):
                            keep_cols.append(col)
                            break
        
        if not keep_cols:
            print("Warning: No covariate columns match the provided pattern!")
        else:
            print(f"Keeping {len(keep_cols)} covariates: {keep_cols}")
            covariates_df = covariates_df[keep_cols]
       

    genotype_df, variant_df = genotypeio.load_genotypes(plink_prefix_path, dosages = True) 
    ## use custom sample list to subset the covariates data
    if $[keep_sample.is_file()]:
        sample_list = pd.read_csv("$[keep_sample:a]", comment="#", header=None, names=["sample_id"], sep="\t")
        covariates_df = covariates_df.loc[sample_list.sample_id]

    ## Retaining only common samples
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
    phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)] 
    covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
    pr = genotypeio.PlinkReader(plink_prefix_path)
    bim_df = pr.bim
    bim_df['chrom'] = bim_df['chrom'].str.replace("chr", "")

    
    ## Get all chromosomes from genotype and phenotype data
    ## To simplify things, there should really not be "chr" prefix 
    phenotype_pos_df.chr = phenotype_pos_df.chr.astype(str).str.replace("chr", "")
    variant_df.chrom = variant_df.chrom.astype("str").str.replace("chr", "") 
    
    pheno_chroms = sorted(phenotype_pos_df.chr.unique().tolist())
    print(f"Phenotype data contains chromosomes: {pheno_chroms}")

    geno_chroms = sorted(variant_df.chrom.unique().tolist())
    print(f"Genotype data contains chromosomes: {geno_chroms}")

    # If current chromosome specified, only process that phenotype chromosome
    if current_chrom and current_chrom in pheno_chroms:
        pheno_chroms_to_process = [current_chrom]
        print(f"Processing only phenotype chromosome: {current_chrom}")
    else:
        pheno_chroms_to_process = pheno_chroms
        print(f"Processing all phenotype chromosomes: {pheno_chroms_to_process}")
    
    # Determine genotype chromosomes to process
    geno_chroms_to_process = geno_chroms
    print(f"Processing all genotype chromosomes: {geno_chroms_to_process}")
    
    # Calculate total combinations
    total_combinations = len(pheno_chroms_to_process) * len(geno_chroms_to_process)
    print(f"Total combinations to process: {total_combinations}")
    all_results = []
    

    # Process each combination
    combination_count = 0
    for pheno_chrom in pheno_chroms_to_process:
        # Filter phenotype data to keep only current chromosome
        phenotype_pos_df_filtered = phenotype_pos_df[phenotype_pos_df.chr == pheno_chrom]
        phenotype_df_filtered = phenotype_df[phenotype_df.index.isin(phenotype_pos_df_filtered.index)]
        
        print(f"Filtered phenotypes for chromosome {pheno_chrom}: {len(phenotype_df_filtered)} remaining")
        
        if len(phenotype_df_filtered) == 0:
            print(f"No phenotypes found for chromosome {pheno_chrom}, skipping")
            continue
        
        for geno_chrom in geno_chroms_to_process:
            combination_count += 1
            print(f"Processing combination {combination_count}/{total_combinations}: phenotype chr{pheno_chrom} x genotype chr{geno_chrom}")
            
            # Filter genotype data to keep only current chromosome
            chrom_variants = variant_df[variant_df.chrom == geno_chrom].index.tolist()
            
            if len(chrom_variants) == 0:
                print(f"No variants found for chromosome {geno_chrom}, skipping")
                continue
            
            # Load genotype data
            print(f"Loading genotypes for chromosome {geno_chrom}...")
            genotype_df = pr.load_genotypes()
            variant_df = bim_df.set_index('snp')[['chrom', 'pos', 'a0', 'a1']]
            
            # Keep only current chromosome variants
            genotype_df_chr = genotype_df.loc[chrom_variants]
            variant_df_chr = variant_df.loc[chrom_variants]

            del genotype_df
            gc.collect()
            # Find common samples
            common_samples = np.intersect1d(phenotype_df_filtered.columns, genotype_df_chr.columns)
            common_samples = np.intersect1d(common_samples, covariates_df.index)
            
            if len(common_samples) == 0:
                print(f"No common samples between phenotypes, genotypes, and covariates, skipping")
                del genotype_df_chr
                gc.collect()
                continue
            
            phenotype_df_final = phenotype_df_filtered[common_samples]
            genotype_df_final = genotype_df_chr[common_samples]
            covariates_df_final = covariates_df.loc[common_samples]
            
            print(f"Final analysis dimensions:")
            print(f"  Samples: {len(common_samples)}")
            print(f"  Phenotypes: {len(phenotype_df_final)}")
            print(f"  Variants: {len(genotype_df_final)}")
            
            # Trans analysis
            print(f"Running trans analysis for pheno chr{pheno_chrom} x geno chr{geno_chrom}...")
            try:
                trans_df = trans.map_trans(genotype_df_final, 
                                        phenotype_df_final,
                                        covariates_df_final, 
                                        batch_size=$[batch_size],
                                        return_sparse=True, 
                                        return_r2=True,
                                        pval_threshold=$[pval_threshold], 
                                        maf_threshold=$[maf_threshold])
                
                del genotype_df_chr, genotype_df_final
                gc.collect()

                # Filter out cis signal
                if trans_df is not None and not trans_df.empty:
                    print(f"Filtering cis signals...")
                    trans_df = trans.filter_cis(trans_df, phenotype_pos_df_filtered, variant_df_chr, window=window)

                    if trans_df is not None and not trans_df.empty:
                        print(f"Found {len(trans_df)} trans-QTLs")
                        trans_df.rename(columns={"phenotype_id": "molecular_trait_id", 
                                            "pval": "pvalue", 
                                            "b": "bhat", "b_se": "sebhat"}, inplace=True)
                        trans_df["n"] = len(common_samples)
                        
                        # Merge variant information
                        trans_df = variant_df_chr.merge(trans_df, right_on='variant_id', left_index=True)
                        trans_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
                        trans_df['pheno_chrom'] = pheno_chrom
                        trans_df['geno_chrom'] = geno_chrom
                        all_results.append(trans_df)
                    else:
                        print(f"No trans-QTLs found after cis filtering")
                else:
                    print(f"No trans-QTLs found")
            except Exception as e:
                print(f"Error during trans analysis: {e}")
                continue
    
    if all_results:
        print(f"Merging {len(all_results)} result sets...")
        combined_results = pd.concat(all_results, ignore_index=True)
        print(f"Total trans-QTLs found: {len(combined_results)}")
        
        # Calculate genomic inflation factor
        lambda_col = combined_results.groupby("molecular_trait_id").apply(
            lambda x: chi2.ppf(1. - np.median(x.pvalue), 1)/chi2.ppf(0.5,1)
        )
        lambda_col = lambda_col.reset_index()
        lambda_col.columns = ['molecular_trait_id', 'genomic_inflation_lambda']
        lambda_col.to_csv("$[_output:nnn].genomic_inflation.tsv.gz", sep='\t', index=None, 
                        compression={'method': 'gzip', 'compresslevel': 9})
        combined_results = combined_results.sort_values(by=['chrom', 'pos', 'molecular_trait_id'])

        # Output information
        print('Output Information:')
        print(f"Output Rows: {len(combined_results)}")
        print(f"Output Columns: {combined_results.columns.tolist()}")
        if len(combined_results) > 0:
            print(f"Output Preview (first 5 rows):")
            print(combined_results.iloc[0:min(5, len(combined_results)), 0:10])
        
        output_file = "$[_output['nominal']:n]"
        combined_results.to_csv(output_file, sep='\t', index=None)
        print(f"Results saved to {output_file}")
    else:
        print("No trans-QTLs found across all chromosome combinations")
        with open("$[_output['nominal']:n]", 'w') as f:
            f.write("No trans-QTLs found across any chromosome combinations\n")

R: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
    library(purrr)
    library(tidyr)
    library(readr)
    library(dplyr)
    library(qvalue)

    compute_qvalues <- function(pvalues) {
        tryCatch({
            if(length(pvalues) < 2) {
                return(pvalues)
            } else {
                return(qvalue(pvalues)$qvalues)
            }
        }, error = function(e) {
            message("Too few p-values to calculate qvalue, fall back to BH")
            qvalue(pvalues, pi0 = 1)$qvalues
        })
    }
    
    pairs_df <- read.table("$[_output['nominal']:n]", header = TRUE, sep = '\t', stringsAsFactors = FALSE)

    if (nrow(pairs_df) <= 1) {
        message("File is empty or has only header. No qvalues to calculate.")
    } else {
        unique_traits <- unique(pairs_df$molecular_trait_id)
        for(trait in unique_traits) {
            trait_rows <- pairs_df$molecular_trait_id == trait
            pairs_df$qvalue[trait_rows] <- compute_qvalues(pairs_df$pvalue[trait_rows])
        }
        
        write.table(pairs_df, "$[_output['nominal']:n]", sep = '\t', row.names = FALSE, quote = FALSE)
    }

bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
        bgzip --compress-level 9 $[_output["nominal"]:n]
        tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]]

[trans_2]
# This workflow filters significant trans-QTL associations based on p-value or q-value cutoffs
input: [str(x) for x in output_from('trans_1') if str(x).endswith('.trans_qtl.pairs.tsv.gz')], group_by = 1

if qvalue_cutoff:
    cutoff_name = f"qvalue_{qvalue_cutoff}"
elif pvalue_cutoff:
    cutoff_name = f"pvalue_{pvalue_cutoff}"
else:
    cutoff_name = "pvalue_5e-8"
output: f'{cwd}/{_input:bn}.{cutoff_name}_filtered_trans_pairs.tsv'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand="$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
    #!/bin/bash
    set -e
    input_file="$[_input]"
    pvalue_cutoff="$[pvalue_cutoff]"
    qvalue_cutoff="$[qvalue_cutoff]"
    
    # Determine cutoff value and column
    if [ -n "$qvalue_cutoff" ]; then
        cutoff_value="$qvalue_cutoff"
        cutoff_col=$(zcat "$input_file" | awk -F '\t' -v col='qvalue' 'NR==1{for (i=1; i<=NF; i++) if ($i==col) {print i;exit}}')
        cutoff_name="qvalue_${qvalue_cutoff}"
        echo "Using q-value cutoff: $cutoff_value"
    elif [ -n "$pvalue_cutoff" ]; then
        cutoff_value="$pvalue_cutoff"
        cutoff_col=$(zcat "$input_file" | awk -F '\t' -v col='pvalue' 'NR==1{for (i=1; i<=NF; i++) if ($i==col) {print i;exit}}')
        cutoff_name="pvalue_${pvalue_cutoff}"
        echo "Using p-value cutoff: $cutoff_value"
    else
        echo "No cutoff provided, using default p-value cutoff of 5e-8"
        cutoff_value="5e-8"
        cutoff_col=$(zcat "$input_file" | awk -F '\t' -v col='pvalue' 'NR==1{for (i=1; i<=NF; i++) if ($i==col) {print i;exit}}')
        cutoff_name="pvalue_5e-8"
    fi
    
    echo "Using column $cutoff_col with cutoff value $cutoff_value"
    
    # Create the final output directory and file
    input_basename=$(basename "$input_file" .gz)
    input_basename=${input_basename%.tsv}
    final_output_file="$[_output[0]]"
    mkdir -p $(dirname "$final_output_file")

    zcat "$input_file" | head -n 1 > "$final_output_file"
    # Filter and append significant associations
    echo "Processing $input_file..."
    zcat "$input_file" | awk -v cutoff="$cutoff_value" -v col="$cutoff_col" 'NR > 1 && $col < cutoff' >> "$final_output_file"
    echo "Total significant associations found: $(wc -l < "$final_output_file") (including header)"

QTL Association Testing

Contents

QTL Association Testing#

Description#

Input#

Example phenotype list#

Output#

Nominal Association Results#

Multiple Testing Corrected Results:#

Interaction Association Results#

Multiple Testing Corrected Results:#

Region (Gene) Level Association Evidence#

Multiple Testing Corrected Results:#

Minimal Working Example Steps#

i. Cis TensorQTL Command#

ii. Trans TensorQTL Command#

Troubleshooting#

Command Interface#

Old Minimal working example#

Setup and global parameters#

cis-xQTL association testing#

Trans-xQTL association testing#