QTL Association Testing (TensorQTL)#

This notebook performs cis- and trans-QTL association testing using TensorQTL. For each molecular phenotype it tests association against nearby (cis) or genome-wide (trans) genetic variants and reports nominal and region-level statistics.

Description#

Two workflows are available:

  • cis : association between each molecular phenotype and variants within a window around it (default 1 Mb around the TSS). Produces nominal pair statistics plus region (gene) level significance.

  • trans: association between phenotypes and variants on a chosen genotype chromosome, optionally restricted to a region list.

Inputs are the by-chromosome genotype/phenotype file lists and the covariate matrix produced by the genotype, phenotype, and covariate preprocessing steps.

Timing: cis ~1 min; trans ~1.5 min on the toy dataset.

Input Files#

Parameter

Toy file

Description

--genotype-file

output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt

Two-column list (chrom id, PLINK bed prefix) of by-chromosome genotypes

--phenotype-file

output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt

List of bgzipped+tabixed molecular phenotype bed.gz files by chromosome

--covariate-file

output/covariate/...prune.pca.Marchenko_PC.gz

Covariate matrix (known covariates + hidden factors)

--region-list

data/combined_AD_genes.csv

(trans, optional) subset of regions/genes to analyze; gene name in column 4

# Load required libraries
library(data.table)
library(genio)

# ------------------------------------------------------------------
# Example phenotype file (MWE: protocol_example, chr22)
# ------------------------------------------------------------------
# A text file listing one phenotype .bed.gz per chromosome.
pheno_path <- fread("output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt")
head(pheno_path)

# One chromosome of phenotype data in BED-like format (.bed.gz).
pheno <- fread("output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr22.bed.gz")
pheno[1:5, 1:8]
A data.table: 1 × 2
#id#dir
<int><chr>
22output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr22.bed.gz
A data.table: 5 × 8
#chrstartendIDSAMPLE_001SAMPLE_002SAMPLE_003SAMPLE_004
<chr><int><int><chr><dbl><dbl><dbl><dbl>
chr221093938710961337ENSG0000028304737.0768400.000005.583274 0.0000000
chr221552819115529138ENSG00000130538 0.0000001.083588.913977 4.0885045
chr221561175815613095ENSG00000231565 5.9780840.000002.63711310.4666167
chr221561540115615577ENSG00000226474 0.0000000.000002.870204 0.0000000
chr221574915515750824ENSG00000235992 5.8202530.000000.000000 0.5868365

Example genotype file#

# ------------------------------------------------------------------
# Example genotype file (MWE: protocol_example)
# ------------------------------------------------------------------
# A text file with two columns: chromosome ID and PLINK prefix path.
geno_path <- fread(file.path("output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt"))
head(geno_path)

# Read the merged PLINK set (.bed/.bim/.fam) for the toy sample.
plink_data <- read_plink("output/plink/protocol_example.genotype.merged.plink_qc")
genotypes <- plink_data$X
genotypes[1:5, 1:5]
A data.table: 6 × 2
#id#path
<int><chr>
2/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.2.bed
6/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.6.bed
13/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.13.bed
8/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.8.bed
7/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.7.bed
11/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.11.bed
Reading: output/plink/protocol_example.genotype.merged.plink_qc.bim

Reading: output/plink/protocol_example.genotype.merged.plink_qc.fam

Reading: output/plink/protocol_example.genotype.merged.plink_qc.bed
A matrix: 5 × 5 of type int
SAMPLE_001SAMPLE_002SAMPLE_003SAMPLE_004SAMPLE_005
chr1:113886_CTCTT_C00000
chr1:191223_C_*00000
chr1:191223_C_G00000
chr1:191223_C_T00000
chr1:275436_A_G00000

Example covariates file#

# ------------------------------------------------------------------
# Example covariates file (MWE: protocol_example)
# ------------------------------------------------------------------
cov <- fread("output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz")
cov[, 1:5]
dim(cov)
A data.table: 28 × 5
#idSAMPLE_001SAMPLE_002SAMPLE_003SAMPLE_004
<chr><dbl><dbl><dbl><dbl>
msex 1.000000000 1.00000000 1.00000000 1.000000000
age_death 90.97000000080.2400000083.9000000074.100000000
pmi 10.570000000 7.87000000 9.93000000 2.910000000
study 1.000000000 0.00000000 0.00000000 0.000000000
PC1 -0.233468563 0.01961037 0.11352412-0.040027540
PC2 0.327658887 0.04553162-0.09470268-0.146030127
PC3 0.098807354 0.16302273 0.04634796 0.026222852
PC4 -0.330900271 0.14254433 0.20275994-0.001605085
PC5 -0.310150069 0.10916400-0.18238419-0.120223005
PC6 -0.218843913-0.16840993-0.05711533-0.221893629
PC7 -0.040665022 0.07576872 0.14606728 0.136076636
PC8 -0.286365603 0.14011839-0.08323989 0.040693345
PC9 0.078883981 0.14174392 0.13033626 0.332998336
PC10 -0.151439052-0.08918759 0.25286117-0.152520170
PC11 0.118499177-0.05618921-0.18365505-0.038165749
PC12 0.053282287-0.07951395 0.05349163-0.081934005
PC13 0.115028477-0.01330522-0.01534750-0.070994143
PC14 0.087845445-0.26853232-0.15027729 0.218599476
PC15 0.001030374 0.07595356-0.08898327-0.181299348
Hidden_Factor_PC1 0.851879348 0.98406483 5.18812380-5.667832700
Hidden_Factor_PC2-2.825894932 1.69447908 0.06854215-4.619987817
Hidden_Factor_PC3 3.763782166-0.80512535-5.77247349-3.825334825
Hidden_Factor_PC4-1.301340188 3.15633800 2.42980902-3.266057955
Hidden_Factor_PC5 0.556485272-1.26228531-3.94541314 2.110667490
Hidden_Factor_PC6-4.493764002-0.11490919-2.51105454 0.278112795
Hidden_Factor_PC7 2.846210676-0.42143392 0.34488558-3.151865035
Hidden_Factor_PC8-1.876624232-9.65305786 4.25303685-1.043104423
Hidden_Factor_PC9 3.126759781 0.48805759 5.52493678 1.428784448
  1. 28
  2. 61

Example customized cis window file (Optional)#

# ------------------------------------------------------------------
# Example customized cis-window file (Optional)
# ------------------------------------------------------------------
# Gene-level cis-window (start/end) defining the SNP search region per gene.
# Omit to use --window for a symmetric +/-1Mb window around each gene TSS.
window <- fread("input/reference_data/TAD/TADB_enhanced_cis.bed")
head(window)
A data.table: 6 × 4
#chrstartendgene_id
<chr><int><int><chr>
chr106480000ENSG00000008128
chr106480000ENSG00000008130
chr106480000ENSG00000067606
chr107101193ENSG00000069424
chr107960000ENSG00000069812
chr106480000ENSG00000078369

Example interaction file#

# ------------------------------------------------------------------
# Example interaction file (Optional, for iQTL analysis)
# ------------------------------------------------------------------
# Provide an interaction term as a separate file via --interaction
# when it is not already part of the covariate file.
int <- fread("input/ROSMAP_interaction_example.tsv")
head(int)
Error in fread("input/ROSMAP_interaction_example.tsv"): File 'input/ROSMAP_interaction_example.tsv' does not exist or is non-readable. getwd()=='/mnt/lustre/lab/gwang/home/rl3328/protocol_testing/xqtl-protocol_archived'
Traceback:

1. stopf("File '%s' does not exist or is non-readable. getwd()=='%s'", 
 .     file, getwd())
2. raise_condition(stop, gettextf(fmt, ..., domain = domain), c(class, 
 .     "simpleError", "error", "condition"))
3. signal(obj)

Output Files#

cis (output/tensorqtl_cis/):

File

Description

*.cis_qtl.pairs.tsv.gz (+.tbi)

Nominal variant-phenotype association statistics

*.cis_qtl_regional_significance.tsv.gz

Region (gene) level significance

*.cis_qtl_regional_significance.summary.txt

Summary of regional results

*.cis_qtl_pairs.<chr>.parquet

Per-chromosome nominal results (parquet)

trans (output/tensorqtl_trans/):

File

Description

*_geno_<chr>.trans_qtl.pairs.tsv.gz (+.tbi)

Trans variant-phenotype association statistics

Steps#

The commands below run on the included toy data and write results under --cwd. The genotype, phenotype, and covariate inputs are produced by the preprocessing notebooks.

cis-QTL association#

Test each molecular phenotype against variants within the cis window. Matching chromosomes between the genotype and phenotype lists are analyzed (chr22 in the toy data). --MAC 5 sets the minor-allele-count cutoff for the small toy sample.

sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt \
    --phenotype-file output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt \
    --covariate-file output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --cwd output/tensorqtl_cis --name protocol_example --MAC 5 --numThreads 2

trans-QTL association#

For trans-analysis:

Computational strategy designed for trans analysis:

Trans analysis faces significant memory challenges as we calculate all associations between all molecular traits × all genetic variants across the genome, creating a massive computational burden. To address this challenge, we implement a two-stage chromosome-based parallelization approach:

Stage 1 (trans_1): Chromosome-based parallelization

  • Phenotype data is processed per chromosome (e.g., 22 separate jobs for autosomes)

  • For each phenotype chromosome, we test associations against variants from all 22 chromosomes

  • This creates phenotype_chr × genotype_chr combinations (e.g., phenotype chr1 vs genotype chr1-22); Garbage was collected between each chromosome combination caculation to release memory

  • Results are combined across all chromosome combinations and saved as compressed files

Stage 2 (trans_2): Significance filtering

  • Supports p-value cutoffs (--pvalue-cutoff) or q-value cutoffs (--qvalue-cutoff)

Test phenotypes against variants on a chosen genotype chromosome (--trans-geno-chromosome 22), restricted to the genes in --region-list. Trans analysis is memory-heavy genome-wide; here it is scoped to the toy chromosome.

sos run pipeline/TensorQTL.ipynb trans \
    --genotype-file output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt \
    --phenotype-file output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt \
    --covariate-file output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --cwd output/tensorqtl_trans --name protocol_example --MAC 5 --numThreads 2 \
    --trans-geno-chromosome 22 --region-list data/combined_AD_genes.csv --region-list-phenotype-column 4

Output#

For each chromosome, several summary statistics files are generated, including both nominal test statistics for each test and region (gene) level association evidence.

Nominal Association Results#

The columns of the nominal association result are as follows:

  • chrom: Variant chromosome.

  • pos: Variant chromosomal position (basepairs).

  • molecular_trait_id: Molecular trait identifier (gene).

  • variant_id: ID of the variant (rsid or chr:position:ref:alt).

  • tss_distance: Distance of the SNP to the gene transcription start site (TSS).

  • tes_distance: Distance of the SNP to the gene transcription end site (TES).

  • cis_window_start_distance: Distance of the SNP to the start of the cis window (if using a customized cis window).

  • cis_window_end_distance: Distance of the SNP to the end of the cis window (if using a customized cis window).

  • af: The allele frequency of this SNP.

  • ma_samples: Number of samples carrying the minor allele.

  • ma_count: Total number of minor alleles across individuals.

  • pvalue: Nominal P-value from linear regression.

  • bhat: Slope of the linear regression.

  • sebhat: Standard error of bhat.

  • n: Number of phenotypes after basic QC.

Multiple Testing Corrected Results:#

  • qvalue: Calculated q-value for each SNP (grouped by gene).

Interaction Association Results#

The columns of interaction association results are as follows (FIXME):

Model:
$\( \text{phenotype} = \beta_0 + \beta_1 \cdot \text{snp} + \beta_2 \cdot \text{msex} + \beta_3 \cdot (\text{snp} \times \text{msex}) + \epsilon \)$

(Taking msex as the interaction factor)

  • chrom: Chromosome number.

  • pos: Variant chromosomal position (basepairs).

  • a2: Variant reference allele (A, C, T, or G).

  • a1: Variant alternate allele.

  • molecular_trait_id: Molecular trait identifier, varies from phenotypes to phenotypes.

  • variant_id: ID of the top variant (rsid or chr:position:ref:alt).

  • af: Alternative allele frequency in the MiGA cohort.

  • ma_samples: Number of samples carrying the minor allele.

  • ma_count: Total number of minor alleles across individuals.

  • pvalue: P-value of the main effect from the nonlinear regression.

  • bhat: Slope of the main effect from the nonlinear regression.

  • se: Standard error of beta.

  • pvalue_msex: P-value of the msex term from the nonlinear regression.

  • bhat_msex: Slope of the msex term from the nonlinear regression.

  • se_msex: Standard error of bhat_msex.

  • pvalue_msex_interaction: P-value of the interaction term from the nonlinear regression.

  • bhat_msex_interaction: Slope of the interaction term from the nonlinear regression.

  • se_msex_interaction: Standard error of beta_msex_interaction.

  • molecular_trait_object_id: An intermediate ID (can be ignored).

  • n: Number of samples.

Multiple Testing Corrected Results:#

  • qvalue_main: The q-value of the main effect.

  • qvalue_interaction: The q-value of the interaction effect.

Region (Gene) Level Association Evidence#

The column specifications for region-level association evidence are as follows:

  • chrom: Chromosome number.

  • pos: Variant chromosomal position (basepairs).

  • n_variant: Total number of variants tested in cis.

  • beta_shape1: First parameter value of the fitted beta distribution.

  • beta_shape2: Second parameter value of the fitted beta distribution.

  • true_df: Effective degrees of freedom of the beta distribution approximation.

  • p_true_df: Empirical P-value for the beta distribution approximation.

  • variant_id: ID of the top variant (rsid or chr:position:ref:alt).

  • tss_distance: Distance of the SNP to the gene transcription start site (TSS).

  • tes_distance: Distance of the SNP to the gene transcription end site (TES).

  • ma_samples: Number of samples carrying the minor allele.

  • ma_count: Total number of minor alleles across individuals.

  • af: Alternative allele frequency.

  • p_nominal: Nominal P-value from linear regression.

  • bhat: Slope of the linear regression.

  • sehat: Standard error of the bhat.

  • p_perm: First permutation P-value directly obtained from the permutations with the direct method.

  • p_beta: Second permutation P-value obtained via beta approximation (this is the one to use for downstream analysis).

  • molecular_trait_object_id: Molecular trait identifier (gene).

  • n_traits: Group size in the permutation test.

  • genomic_inflation: Genomic inflation factor (lambda), quantifying the extent of bulk inflation and the excess false positive rate.

Multiple Testing Corrected Results:#

  • q_beta: Q-value for p_beta using Storey’s method (qvalue), more conservative than FDR.

  • q_perm: Q-value for p_perm using Storey’s method (qvalue), more conservative than FDR.

  • fdr_beta: Adjusted P-value for p_beta using the Benjamini-Hochberg method (FDR).

  • fdr_perm: Adjusted P-value for p_perm using the Benjamini-Hochberg method (FDR).

  • p_nominal_threshold: Nominal p-value threshold for variants in the corresponding molecular trait, derived from empirical beta distribution as a result of permutation testing.

Command interface#

List all workflows and options:

sos run pipeline/TensorQTL.ipynb -h

Pipeline Code#

The SoS workflow definitions below are unchanged from the original protocol.

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Phenotype file, or a list of phenotype per region.
parameter: phenotype_file = path
# A genotype file in PLINK binary format (bed/bam/fam) format, or a list of genotype per chrom
parameter: genotype_file = path
# Covariate file
parameter: covariate_file = path
# Optional pattern to filter covariates (list of covariate prefixes or exact names)
parameter: covariate_pattern = []
# Prefix for the analysis output
parameter: name = ""
# An optional subset of regions of molecular features to analyze. The last column is the gene names
parameter: region_list = path()
parameter: region_list_phenotype_column = 4
# Set list of sample to be keep
parameter: keep_sample = path()
# FIXME: please document
parameter: interaction = ""

# An optional list documenting the custom cis window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_cis_windows = path()

# The phenotype group file to group molecule_trait into molecule_trait_object
# This applies to multiple molecular events in the same region, such as sQTL analysis.
parameter: phenotype_group = path() 

# The name of phenotype corresponding to gene_id or gene_name in the region
parameter: chromosome = []
# Minor allele count cutoff
parameter: MAC = 0
# Specify genotype chromosome for trans analysis (e.g. --trans-geno-chromosome 5)
# When set, trans step loads this genotype chrom instead of the one from input_files
parameter: trans_geno_chromosome = ""

# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# This parameter will be set to zero if `customized_cis_windows` is provided.
parameter: window = 1000000

# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
parameter: entrypoint = ''
import re

# Use the header of the covariate file to decide the sample size
import pandas as pd
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1

# Minor allele frequency cutoff. It will overwrite minor allele cutoff.
# You may consider setting it to higher for interaction analysis if you have statistical power concerns
parameter: maf_threshold = MAC/(2.0*N)

# Filtering significant trans associations (for trans_2 workflow)
parameter: pvalue_cutoff = "5e-8"
parameter: qvalue_cutoff = ""


import os
import pandas as pd

def adapt_file_path(file_path, reference_file):
    """
    Adapt a single file path based on its existence and a reference file's path.

    Args:
    - file_path (str): The file path to adapt.
    - reference_file (str): File path to use as a reference for adaptation.

    Returns:
    - str: Adapted file path.

    Raises:
    - FileNotFoundError: If no valid file path is found.
    """
    reference_path = os.path.dirname(reference_file)

    # Check if the file exists
    if os.path.isfile(file_path):
        return file_path

    # Check file name without path
    file_name = os.path.basename(file_path)
    if os.path.isfile(file_name):
        return file_name

    # Check file name in reference file's directory
    file_in_ref_dir = os.path.join(reference_path, file_name)
    if os.path.isfile(file_in_ref_dir):
        return file_in_ref_dir

    # Check original file path prefixed with reference file's directory
    file_prefixed = os.path.join(reference_path, file_path)
    if os.path.isfile(file_prefixed):
        return file_prefixed

    # If all checks fail, raise an error
    raise FileNotFoundError(f"No valid path found for file: {file_path}")

def adapt_file_path_all(df, column_name, reference_file):
    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))


if (str(genotype_file).endswith("bed") or str(genotype_file).endswith("pgen")) and str(phenotype_file).endswith("bed.gz"):
    input_files = [[phenotype_file, genotype_file]]
    if len(chromosome) > 0:
        input_chroms = [int(x) for x in chromosome]
    else:
        input_chroms = [0]
else:
    import pandas as pd
    import os
    molecular_pheno_files = pd.read_csv(phenotype_file, sep = "\t")
    
    if "#dir" in molecular_pheno_files.columns and "#chr" not in molecular_pheno_files.columns:
        molecular_pheno_files = molecular_pheno_files.rename(columns={"#dir": "path"})
        if "#id" in molecular_pheno_files.columns:
            molecular_pheno_files = molecular_pheno_files.rename(columns={"#id": "#chr"})
    
    if "#chr" in molecular_pheno_files.columns:
        molecular_pheno_files = molecular_pheno_files.groupby(['#chr','path']).size().reset_index(name='count').drop("count",axis = 1).rename(columns = {"#chr":"#id"})
    genotype_files = pd.read_csv(genotype_file,sep = "\t")
    genotype_files["#id"] = [x.replace("chr","") for x in genotype_files["#id"].astype(str)] # e.g. remove chr1 to 1
    genotype_files["#path"] = genotype_files["#path"].apply(lambda x: adapt_file_path(x, genotype_file))
    molecular_pheno_files["#id"] = [x.replace("chr","") for x in molecular_pheno_files["#id"].astype(str)]
    input_files = molecular_pheno_files.merge(genotype_files, on = "#id")
    
    # Only keep chromosome specified in --chromosome
    if len(chromosome) > 0:
        input_files = input_files[input_files['#id'].isin(chromosome)]
    input_files = input_files.values.tolist()
    input_chroms = [x[0] for x in input_files]
    input_files = [x[1:] for x in input_files]
    if len(name) == 0:
        name = f'{path(input_files[0][0]):bnn}' if len(input_files) == 1 else f'{path(input_files[0][0]):bnnn}'
[cis_1]
# parse input file lists
# skip nominal association results if the files exists already
# This is false by default which means to recompute everything
# This is only relevant when the `parquet` files for nominal results exist but not the other files
# and you want to avoid computing the nominal results again
parameter: skip_nominal_if_exist = False
parameter: permutation = True

# Extract interaction name
var_interaction = interaction
if os.path.isfile(interaction):
    interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
    var_interaction = interaction_s.columns[0] # interaction name
test_regional_association = permutation and len(var_interaction) == 0

input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output_files = dict([("parquet", f'{cwd:a}/{_input[0]:bnn}{"_%s" % var_interaction if interaction else ""}.cis_qtl_pairs.{"" if input_chroms[_index] == 0 else input_chroms[_index]}.parquet'), # This convention is necessary to match the pattern of map_norminal output
                     ("nominal", f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.pairs.tsv.gz')])
if test_regional_association:
    output_files["regional"] = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.regional.tsv.gz'
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output["nominal"]:bnnn}'
bash: expand= "${ }", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/association_scan/TensorQTL/TensorQTL.py \
        --step cis \
        --cwd "${cwd}" \
        --genotype-file "${_input[1]}" \
        --phenotype-file "${_input[0]}" \
        --covariate-file "${covariate_file}" \
        --interaction "${interaction}" \
        --chromosome ${input_chroms[_index]} \
        --window ${window} \
        --MAC ${MAC} \
        --maf-threshold ${maf_threshold} \
        --pvalue-cutoff "${pvalue_cutoff}" \
        --qvalue-cutoff "${qvalue_cutoff}" \
        --permutation ${permutation} \
        ${"--skip-nominal-if-exist" if skip_nominal_if_exist else ""} \
        ${"--covariate-pattern " + " ".join([str(x) for x in covariate_pattern]) if len(covariate_pattern) else ""} \
        ${"--region-list " + str(region_list) + " --region-list-phenotype-column " + str(region_list_phenotype_column) if region_list.is_file() else ""} \
        ${"--keep-sample " + str(keep_sample) if keep_sample.is_file() else ""} \
        ${"--customized-cis-windows " + str(customized_cis_windows) if customized_cis_windows.is_file() else ""} \
        ${"--phenotype-group " + str(phenotype_group) if phenotype_group.is_file() else ""} \
        --skip-postprocess \
        --numThreads ${numThreads}
[cis_2]
done_if("regional" not in _input.labels)
input: group_by = "all"
output_file_prefix = name if len(_input["nominal"]) > 1 else f'{_input["nominal"][0]:bnnnn}'
output: f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.tsv.gz',
        f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.summary.txt'
input_files = [str(x) for x in _input["regional"]]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/association_scan/TensorQTL/TensorQTL.py \
        --step cis_postprocess \
        --cwd "${cwd}" \
        --output-tsv "${_output[0]}" \
        --output-summary "${_output[1]}" \
        --regional-files ${" ".join(input_files)} \
        --numThreads ${numThreads}
[trans]

parameter: batch_size = 10000
parameter: pval_threshold = 1.0
# Permutation testing is incorrect when the analysis is done by chrom
parameter: permutation = False
parameter: pval = 0.0

input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output: nominal = f'{cwd:a}/{_input[0]:bnn}{"_chr%s" % input_chroms[_index] if input_chroms[_index] != 0 else ""}{"_geno_chr%s" % trans_geno_chromosome if trans_geno_chromosome else ""}.trans_qtl{"_p_%.0e" % pval if pval > 0.0 else ""}.pairs.tsv.gz'

trans_genotype_file = genotype_file if trans_geno_chromosome else _input[1]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/association_scan/TensorQTL/TensorQTL.py \
        --step trans \
        --cwd "${cwd}" \
        --genotype-file "${trans_genotype_file}" \
        --phenotype-file "${_input[0]}" \
        --covariate-file "${covariate_file}" \
        --output "${_output['nominal']}" \
        --trans-geno-chromosome "${trans_geno_chromosome}" \
        --window ${window} \
        --MAC ${MAC} \
        --maf-threshold ${maf_threshold} \
        --batch-size ${batch_size} \
        --pval-threshold ${pval_threshold} \
        --pval ${pval} \
        --pvalue-cutoff "${pvalue_cutoff}" \
        --qvalue-cutoff "${qvalue_cutoff}" \
        ${"--covariate-pattern " + " ".join([str(x) for x in covariate_pattern]) if len(covariate_pattern) else ""} \
        ${"--region-list " + str(region_list) + " --region-list-phenotype-column " + str(region_list_phenotype_column) if region_list.is_file() else ""} \
        ${"--keep-sample " + str(keep_sample) if keep_sample.is_file() else ""} \
        ${"--customized-cis-windows " + str(customized_cis_windows) if customized_cis_windows.is_file() else ""} \
        --numThreads ${numThreads}

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.