QTL Association Testing (TensorQTL)

QTL Association Testing (TensorQTL)#

This notebook performs cis- and trans-QTL association testing using TensorQTL. For each molecular phenotype it tests association against nearby (cis) or genome-wide (trans) genetic variants and reports nominal and region-level statistics.

Description#

Two workflows are available:

cis : association between each molecular phenotype and variants within a window around it (default 1 Mb around the TSS). Produces nominal pair statistics plus region (gene) level significance.
trans: association between phenotypes and variants on a chosen genotype chromosome, optionally restricted to a region list.

Inputs are the by-chromosome genotype/phenotype file lists and the covariate matrix produced by the genotype, phenotype, and covariate preprocessing steps.

Timing: cis ~1 min; trans ~1.5 min on the toy dataset.

Input Files#

Parameter	Toy file	Description
`--genotype-file`	`output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt`	Two-column list (chrom id, PLINK bed prefix) of by-chromosome genotypes
`--phenotype-file`	`output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt`	List of bgzipped+tabixed molecular phenotype `bed.gz` files by chromosome
`--covariate-file`	`output/covariate/...prune.pca.Marchenko_PC.gz`	Covariate matrix (known covariates + hidden factors)
`--region-list`	`data/combined_AD_genes.csv`	(trans, optional) subset of regions/genes to analyze; gene name in column 4

# Load required libraries
library(data.table)
library(genio)

# ------------------------------------------------------------------
# Example phenotype file (MWE: protocol_example, chr22)
# ------------------------------------------------------------------
# A text file listing one phenotype .bed.gz per chromosome.
pheno_path <- fread("output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt")
head(pheno_path)

# One chromosome of phenotype data in BED-like format (.bed.gz).
pheno <- fread("output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr22.bed.gz")
pheno[1:5, 1:8]

A data.table: 1 × 2
#id	#dir
<int>	<chr>
22	output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr22.bed.gz

A data.table: 5 × 8
#chr	start	end	ID	SAMPLE_001	SAMPLE_002	SAMPLE_003	SAMPLE_004
<chr>	<int>	<int>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
chr22	10939387	10961337	ENSG00000283047	37.076840	0.00000	5.583274	0.0000000
chr22	15528191	15529138	ENSG00000130538	0.000000	1.08358	8.913977	4.0885045
chr22	15611758	15613095	ENSG00000231565	5.978084	0.00000	2.637113	10.4666167
chr22	15615401	15615577	ENSG00000226474	0.000000	0.00000	2.870204	0.0000000
chr22	15749155	15750824	ENSG00000235992	5.820253	0.00000	0.000000	0.5868365

Example genotype file#

# ------------------------------------------------------------------
# Example genotype file (MWE: protocol_example)
# ------------------------------------------------------------------
# A text file with two columns: chromosome ID and PLINK prefix path.
geno_path <- fread(file.path("output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt"))
head(geno_path)

# Read the merged PLINK set (.bed/.bim/.fam) for the toy sample.
plink_data <- read_plink("output/plink/protocol_example.genotype.merged.plink_qc")
genotypes <- plink_data$X
genotypes[1:5, 1:5]

A data.table: 6 × 2
#id	#path
<int>	<chr>
2	/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.2.bed
6	/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.6.bed
13	/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.13.bed
8	/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.8.bed
7	/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.7.bed
11	/restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.11.bed

Reading: output/plink/protocol_example.genotype.merged.plink_qc.bim

Reading: output/plink/protocol_example.genotype.merged.plink_qc.fam

Reading: output/plink/protocol_example.genotype.merged.plink_qc.bed

A matrix: 5 × 5 of type int
	SAMPLE_001	SAMPLE_002	SAMPLE_003	SAMPLE_004	SAMPLE_005
chr1:113886_CTCTT_C	0	0	0	0	0
chr1:191223_C_*	0	0	0	0	0
chr1:191223_C_G	0	0	0	0	0
chr1:191223_C_T	0	0	0	0	0
chr1:275436_A_G	0	0	0	0	0

Example covariates file#

# ------------------------------------------------------------------
# Example covariates file (MWE: protocol_example)
# ------------------------------------------------------------------
cov <- fread("output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz")
cov[, 1:5]
dim(cov)

A data.table: 28 × 5
#id	SAMPLE_001	SAMPLE_002	SAMPLE_003	SAMPLE_004
<chr>	<dbl>	<dbl>	<dbl>	<dbl>
msex	1.000000000	1.00000000	1.00000000	1.000000000
age_death	90.970000000	80.24000000	83.90000000	74.100000000
pmi	10.570000000	7.87000000	9.93000000	2.910000000
study	1.000000000	0.00000000	0.00000000	0.000000000
PC1	-0.233468563	0.01961037	0.11352412	-0.040027540
PC2	0.327658887	0.04553162	-0.09470268	-0.146030127
PC3	0.098807354	0.16302273	0.04634796	0.026222852
PC4	-0.330900271	0.14254433	0.20275994	-0.001605085
PC5	-0.310150069	0.10916400	-0.18238419	-0.120223005
PC6	-0.218843913	-0.16840993	-0.05711533	-0.221893629
PC7	-0.040665022	0.07576872	0.14606728	0.136076636
PC8	-0.286365603	0.14011839	-0.08323989	0.040693345
PC9	0.078883981	0.14174392	0.13033626	0.332998336
PC10	-0.151439052	-0.08918759	0.25286117	-0.152520170
PC11	0.118499177	-0.05618921	-0.18365505	-0.038165749
PC12	0.053282287	-0.07951395	0.05349163	-0.081934005
PC13	0.115028477	-0.01330522	-0.01534750	-0.070994143
PC14	0.087845445	-0.26853232	-0.15027729	0.218599476
PC15	0.001030374	0.07595356	-0.08898327	-0.181299348
Hidden_Factor_PC1	0.851879348	0.98406483	5.18812380	-5.667832700
Hidden_Factor_PC2	-2.825894932	1.69447908	0.06854215	-4.619987817
Hidden_Factor_PC3	3.763782166	-0.80512535	-5.77247349	-3.825334825
Hidden_Factor_PC4	-1.301340188	3.15633800	2.42980902	-3.266057955
Hidden_Factor_PC5	0.556485272	-1.26228531	-3.94541314	2.110667490
Hidden_Factor_PC6	-4.493764002	-0.11490919	-2.51105454	0.278112795
Hidden_Factor_PC7	2.846210676	-0.42143392	0.34488558	-3.151865035
Hidden_Factor_PC8	-1.876624232	-9.65305786	4.25303685	-1.043104423
Hidden_Factor_PC9	3.126759781	0.48805759	5.52493678	1.428784448

28
61

Example customized cis window file (Optional)#

# ------------------------------------------------------------------
# Example customized cis-window file (Optional)
# ------------------------------------------------------------------
# Gene-level cis-window (start/end) defining the SNP search region per gene.
# Omit to use --window for a symmetric +/-1Mb window around each gene TSS.
window <- fread("input/reference_data/TAD/TADB_enhanced_cis.bed")
head(window)

A data.table: 6 × 4
#chr	start	end	gene_id
<chr>	<int>	<int>	<chr>
chr1	0	6480000	ENSG00000008128
chr1	0	6480000	ENSG00000008130
chr1	0	6480000	ENSG00000067606
chr1	0	7101193	ENSG00000069424
chr1	0	7960000	ENSG00000069812
chr1	0	6480000	ENSG00000078369

Example interaction file#

# ------------------------------------------------------------------
# Example interaction file (Optional, for iQTL analysis)
# ------------------------------------------------------------------
# Provide an interaction term as a separate file via --interaction
# when it is not already part of the covariate file.
int <- fread("input/ROSMAP_interaction_example.tsv")
head(int)

Error in fread("input/ROSMAP_interaction_example.tsv"): File 'input/ROSMAP_interaction_example.tsv' does not exist or is non-readable. getwd()=='/mnt/lustre/lab/gwang/home/rl3328/protocol_testing/xqtl-protocol_archived'
Traceback:

1. stopf("File '%s' does not exist or is non-readable. getwd()=='%s'", 
 .     file, getwd())
2. raise_condition(stop, gettextf(fmt, ..., domain = domain), c(class, 
 .     "simpleError", "error", "condition"))
3. signal(obj)

Output Files#

cis (output/tensorqtl_cis/):

File	Description
`*.cis_qtl.pairs.tsv.gz` (+`.tbi`)	Nominal variant-phenotype association statistics
`*.cis_qtl_regional_significance.tsv.gz`	Region (gene) level significance
`*.cis_qtl_regional_significance.summary.txt`	Summary of regional results
`*.cis_qtl_pairs.<chr>.parquet`	Per-chromosome nominal results (parquet)

trans (output/tensorqtl_trans/):

File	Description
`*_geno_<chr>.trans_qtl.pairs.tsv.gz` (+`.tbi`)	Trans variant-phenotype association statistics

Steps#

The commands below run on the included toy data and write results under --cwd. The genotype, phenotype, and covariate inputs are produced by the preprocessing notebooks.

cis-QTL association#

Test each molecular phenotype against variants within the cis window. Matching chromosomes between the genotype and phenotype lists are analyzed (chr22 in the toy data). --MAC 5 sets the minor-allele-count cutoff for the small toy sample.

sos run pipeline/TensorQTL.ipynb cis \
    --genotype-file output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt \
    --phenotype-file output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt \
    --covariate-file output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --cwd output/tensorqtl_cis --name protocol_example --MAC 5 --numThreads 2

trans-QTL association#

For trans-analysis:

Computational strategy designed for trans analysis:

Trans analysis faces significant memory challenges as we calculate all associations between all molecular traits × all genetic variants across the genome, creating a massive computational burden. To address this challenge, we implement a two-stage chromosome-based parallelization approach:

Stage 1 (trans_1): Chromosome-based parallelization

Phenotype data is processed per chromosome (e.g., 22 separate jobs for autosomes)
For each phenotype chromosome, we test associations against variants from all 22 chromosomes
This creates phenotype_chr × genotype_chr combinations (e.g., phenotype chr1 vs genotype chr1-22); Garbage was collected between each chromosome combination caculation to release memory
Results are combined across all chromosome combinations and saved as compressed files

Stage 2 (trans_2): Significance filtering

Supports p-value cutoffs (--pvalue-cutoff) or q-value cutoffs (--qvalue-cutoff)

Test phenotypes against variants on a chosen genotype chromosome (--trans-geno-chromosome 22), restricted to the genes in --region-list. Trans analysis is memory-heavy genome-wide; here it is scoped to the toy chromosome.

sos run pipeline/TensorQTL.ipynb trans \
    --genotype-file output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt \
    --phenotype-file output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt \
    --covariate-file output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
    --cwd output/tensorqtl_trans --name protocol_example --MAC 5 --numThreads 2 \
    --trans-geno-chromosome 22 --region-list data/combined_AD_genes.csv --region-list-phenotype-column 4

Output#

For each chromosome, several summary statistics files are generated, including both nominal test statistics for each test and region (gene) level association evidence.

Nominal Association Results#

The columns of the nominal association result are as follows:

chrom: Variant chromosome.
pos: Variant chromosomal position (basepairs).
molecular_trait_id: Molecular trait identifier (gene).
variant_id: ID of the variant (rsid or chr:position:ref:alt).
tss_distance: Distance of the SNP to the gene transcription start site (TSS).
tes_distance: Distance of the SNP to the gene transcription end site (TES).
cis_window_start_distance: Distance of the SNP to the start of the cis window (if using a customized cis window).
cis_window_end_distance: Distance of the SNP to the end of the cis window (if using a customized cis window).
af: The allele frequency of this SNP.
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
pvalue: Nominal P-value from linear regression.
bhat: Slope of the linear regression.
sebhat: Standard error of bhat.
n: Number of phenotypes after basic QC.

Multiple Testing Corrected Results:#

qvalue: Calculated q-value for each SNP (grouped by gene).

Interaction Association Results#

The columns of interaction association results are as follows (FIXME):

Model:
$$ \text{phenotype} = \beta_0 + \beta_1 \cdot \text{snp} + \beta_2 \cdot \text{msex} + \beta_3 \cdot (\text{snp} \times \text{msex}) + \epsilon $$

(Taking msex as the interaction factor)

chrom: Chromosome number.
pos: Variant chromosomal position (basepairs).
a2: Variant reference allele (A, C, T, or G).
a1: Variant alternate allele.
molecular_trait_id: Molecular trait identifier, varies from phenotypes to phenotypes.
variant_id: ID of the top variant (rsid or chr:position:ref:alt).
af: Alternative allele frequency in the MiGA cohort.
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
pvalue: P-value of the main effect from the nonlinear regression.
bhat: Slope of the main effect from the nonlinear regression.
se: Standard error of beta.
pvalue_msex: P-value of the msex term from the nonlinear regression.
bhat_msex: Slope of the msex term from the nonlinear regression.
se_msex: Standard error of bhat_msex.
pvalue_msex_interaction: P-value of the interaction term from the nonlinear regression.
bhat_msex_interaction: Slope of the interaction term from the nonlinear regression.
se_msex_interaction: Standard error of beta_msex_interaction.
molecular_trait_object_id: An intermediate ID (can be ignored).
n: Number of samples.

Multiple Testing Corrected Results:#

qvalue_main: The q-value of the main effect.
qvalue_interaction: The q-value of the interaction effect.

Region (Gene) Level Association Evidence#

The column specifications for region-level association evidence are as follows:

chrom: Chromosome number.
pos: Variant chromosomal position (basepairs).
n_variant: Total number of variants tested in cis.
beta_shape1: First parameter value of the fitted beta distribution.
beta_shape2: Second parameter value of the fitted beta distribution.
true_df: Effective degrees of freedom of the beta distribution approximation.
p_true_df: Empirical P-value for the beta distribution approximation.
variant_id: ID of the top variant (rsid or chr:position:ref:alt).
tss_distance: Distance of the SNP to the gene transcription start site (TSS).
tes_distance: Distance of the SNP to the gene transcription end site (TES).
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
af: Alternative allele frequency.
p_nominal: Nominal P-value from linear regression.
bhat: Slope of the linear regression.
sehat: Standard error of the bhat.
p_perm: First permutation P-value directly obtained from the permutations with the direct method.
p_beta: Second permutation P-value obtained via beta approximation (this is the one to use for downstream analysis).
molecular_trait_object_id: Molecular trait identifier (gene).
n_traits: Group size in the permutation test.
genomic_inflation: Genomic inflation factor (lambda), quantifying the extent of bulk inflation and the excess false positive rate.

Multiple Testing Corrected Results:#

q_beta: Q-value for p_beta using Storey’s method (qvalue), more conservative than FDR.
q_perm: Q-value for p_perm using Storey’s method (qvalue), more conservative than FDR.
fdr_beta: Adjusted P-value for p_beta using the Benjamini-Hochberg method (FDR).
fdr_perm: Adjusted P-value for p_perm using the Benjamini-Hochberg method (FDR).
p_nominal_threshold: Nominal p-value threshold for variants in the corresponding molecular trait, derived from empirical beta distribution as a result of permutation testing.

Command interface#

List all workflows and options:

sos run pipeline/TensorQTL.ipynb -h

Pipeline Code#

The SoS workflow definitions below are unchanged from the original protocol.

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Phenotype file, or a list of phenotype per region.
parameter: phenotype_file = path
# A genotype file in PLINK binary format (bed/bam/fam) format, or a list of genotype per chrom
parameter: genotype_file = path
# Covariate file
parameter: covariate_file = path
# Optional pattern to filter covariates (list of covariate prefixes or exact names)
parameter: covariate_pattern = []
# Prefix for the analysis output
parameter: name = ""
# An optional subset of regions of molecular features to analyze. The last column is the gene names
parameter: region_list = path()
parameter: region_list_phenotype_column = 4
# Set list of sample to be keep
parameter: keep_sample = path()
# FIXME: please document
parameter: interaction = ""

# An optional list documenting the custom cis window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_cis_windows = path()

# The phenotype group file to group molecule_trait into molecule_trait_object
# This applies to multiple molecular events in the same region, such as sQTL analysis.
parameter: phenotype_group = path() 

# The name of phenotype corresponding to gene_id or gene_name in the region
parameter: chromosome = []
# Minor allele count cutoff
parameter: MAC = 0
# Specify genotype chromosome for trans analysis (e.g. --trans-geno-chromosome 5)
# When set, trans step loads this genotype chrom instead of the one from input_files
parameter: trans_geno_chromosome = ""

# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# This parameter will be set to zero if `customized_cis_windows` is provided.
parameter: window = 1000000

# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
parameter: entrypoint = ''
import re

# Use the header of the covariate file to decide the sample size
import pandas as pd
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1

# Minor allele frequency cutoff. It will overwrite minor allele cutoff.
# You may consider setting it to higher for interaction analysis if you have statistical power concerns
parameter: maf_threshold = MAC/(2.0*N)

# Filtering significant trans associations (for trans_2 workflow)
parameter: pvalue_cutoff = "5e-8"
parameter: qvalue_cutoff = ""


import os
import pandas as pd

def adapt_file_path(file_path, reference_file):
    """
    Adapt a single file path based on its existence and a reference file's path.

    Args:
    - file_path (str): The file path to adapt.
    - reference_file (str): File path to use as a reference for adaptation.

    Returns:
    - str: Adapted file path.

    Raises:
    - FileNotFoundError: If no valid file path is found.
    """
    reference_path = os.path.dirname(reference_file)

    # Check if the file exists
    if os.path.isfile(file_path):
        return file_path

    # Check file name without path
    file_name = os.path.basename(file_path)
    if os.path.isfile(file_name):
        return file_name

    # Check file name in reference file's directory
    file_in_ref_dir = os.path.join(reference_path, file_name)
    if os.path.isfile(file_in_ref_dir):
        return file_in_ref_dir

    # Check original file path prefixed with reference file's directory
    file_prefixed = os.path.join(reference_path, file_path)
    if os.path.isfile(file_prefixed):
        return file_prefixed

    # If all checks fail, raise an error
    raise FileNotFoundError(f"No valid path found for file: {file_path}")

def adapt_file_path_all(df, column_name, reference_file):
    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))


if (str(genotype_file).endswith("bed") or str(genotype_file).endswith("pgen")) and str(phenotype_file).endswith("bed.gz"):
    input_files = [[phenotype_file, genotype_file]]
    if len(chromosome) > 0:
        input_chroms = [int(x) for x in chromosome]
    else:
        input_chroms = [0]
else:
    import pandas as pd
    import os
    molecular_pheno_files = pd.read_csv(phenotype_file, sep = "\t")
    
    if "#dir" in molecular_pheno_files.columns and "#chr" not in molecular_pheno_files.columns:
        molecular_pheno_files = molecular_pheno_files.rename(columns={"#dir": "path"})
        if "#id" in molecular_pheno_files.columns:
            molecular_pheno_files = molecular_pheno_files.rename(columns={"#id": "#chr"})
    
    if "#chr" in molecular_pheno_files.columns:
        molecular_pheno_files = molecular_pheno_files.groupby(['#chr','path']).size().reset_index(name='count').drop("count",axis = 1).rename(columns = {"#chr":"#id"})
    genotype_files = pd.read_csv(genotype_file,sep = "\t")
    genotype_files["#id"] = [x.replace("chr","") for x in genotype_files["#id"].astype(str)] # e.g. remove chr1 to 1
    genotype_files["#path"] = genotype_files["#path"].apply(lambda x: adapt_file_path(x, genotype_file))
    molecular_pheno_files["#id"] = [x.replace("chr","") for x in molecular_pheno_files["#id"].astype(str)]
    input_files = molecular_pheno_files.merge(genotype_files, on = "#id")
    
    # Only keep chromosome specified in --chromosome
    if len(chromosome) > 0:
        input_files = input_files[input_files['#id'].isin(chromosome)]
    input_files = input_files.values.tolist()
    input_chroms = [x[0] for x in input_files]
    input_files = [x[1:] for x in input_files]
    if len(name) == 0:
        name = f'{path(input_files[0][0]):bnn}' if len(input_files) == 1 else f'{path(input_files[0][0]):bnnn}'

[cis_1]
# parse input file lists
# skip nominal association results if the files exists already
# This is false by default which means to recompute everything
# This is only relevant when the `parquet` files for nominal results exist but not the other files
# and you want to avoid computing the nominal results again
parameter: skip_nominal_if_exist = False
parameter: permutation = True

# Extract interaction name
var_interaction = interaction
if os.path.isfile(interaction):
    interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
    var_interaction = interaction_s.columns[0] # interaction name
test_regional_association = permutation and len(var_interaction) == 0

input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output_files = dict([("parquet", f'{cwd:a}/{_input[0]:bnn}{"_%s" % var_interaction if interaction else ""}.cis_qtl_pairs.{"" if input_chroms[_index] == 0 else input_chroms[_index]}.parquet'), # This convention is necessary to match the pattern of map_norminal output
                     ("nominal", f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.pairs.tsv.gz')])
if test_regional_association:
    output_files["regional"] = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.regional.tsv.gz'
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output["nominal"]:bnnn}'
bash: expand= "${ }", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/association_scan/TensorQTL/TensorQTL.py \
        --step cis \
        --cwd "${cwd}" \
        --genotype-file "${_input[1]}" \
        --phenotype-file "${_input[0]}" \
        --covariate-file "${covariate_file}" \
        --interaction "${interaction}" \
        --chromosome ${input_chroms[_index]} \
        --window ${window} \
        --MAC ${MAC} \
        --maf-threshold ${maf_threshold} \
        --pvalue-cutoff "${pvalue_cutoff}" \
        --qvalue-cutoff "${qvalue_cutoff}" \
        --permutation ${permutation} \
        ${"--skip-nominal-if-exist" if skip_nominal_if_exist else ""} \
        ${"--covariate-pattern " + " ".join([str(x) for x in covariate_pattern]) if len(covariate_pattern) else ""} \
        ${"--region-list " + str(region_list) + " --region-list-phenotype-column " + str(region_list_phenotype_column) if region_list.is_file() else ""} \
        ${"--keep-sample " + str(keep_sample) if keep_sample.is_file() else ""} \
        ${"--customized-cis-windows " + str(customized_cis_windows) if customized_cis_windows.is_file() else ""} \
        ${"--phenotype-group " + str(phenotype_group) if phenotype_group.is_file() else ""} \
        --skip-postprocess \
        --numThreads ${numThreads}

[cis_2]
done_if("regional" not in _input.labels)
input: group_by = "all"
output_file_prefix = name if len(_input["nominal"]) > 1 else f'{_input["nominal"][0]:bnnnn}'
output: f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.tsv.gz',
        f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.summary.txt'
input_files = [str(x) for x in _input["regional"]]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/association_scan/TensorQTL/TensorQTL.py \
        --step cis_postprocess \
        --cwd "${cwd}" \
        --output-tsv "${_output[0]}" \
        --output-summary "${_output[1]}" \
        --regional-files ${" ".join(input_files)} \
        --numThreads ${numThreads}

[trans]

parameter: batch_size = 10000
parameter: pval_threshold = 1.0
# Permutation testing is incorrect when the analysis is done by chrom
parameter: permutation = False
parameter: pval = 0.0

input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output: nominal = f'{cwd:a}/{_input[0]:bnn}{"_chr%s" % input_chroms[_index] if input_chroms[_index] != 0 else ""}{"_geno_chr%s" % trans_geno_chromosome if trans_geno_chromosome else ""}.trans_qtl{"_p_%.0e" % pval if pval > 0.0 else ""}.pairs.tsv.gz'

trans_genotype_file = genotype_file if trans_geno_chromosome else _input[1]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/association_scan/TensorQTL/TensorQTL.py \
        --step trans \
        --cwd "${cwd}" \
        --genotype-file "${trans_genotype_file}" \
        --phenotype-file "${_input[0]}" \
        --covariate-file "${covariate_file}" \
        --output "${_output['nominal']}" \
        --trans-geno-chromosome "${trans_geno_chromosome}" \
        --window ${window} \
        --MAC ${MAC} \
        --maf-threshold ${maf_threshold} \
        --batch-size ${batch_size} \
        --pval-threshold ${pval_threshold} \
        --pval ${pval} \
        --pvalue-cutoff "${pvalue_cutoff}" \
        --qvalue-cutoff "${qvalue_cutoff}" \
        ${"--covariate-pattern " + " ".join([str(x) for x in covariate_pattern]) if len(covariate_pattern) else ""} \
        ${"--region-list " + str(region_list) + " --region-list-phenotype-column " + str(region_list_phenotype_column) if region_list.is_file() else ""} \
        ${"--keep-sample " + str(keep_sample) if keep_sample.is_file() else ""} \
        ${"--customized-cis-windows " + str(customized_cis_windows) if customized_cis_windows.is_file() else ""} \
        --numThreads ${numThreads}

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

QTL Association Testing (TensorQTL)

Contents

QTL Association Testing (TensorQTL)#

Description#

Input Files#

Example genotype file#

Example covariates file#

Example customized cis window file (Optional)#

Example interaction file#

Output Files#

Steps#

cis-QTL association#

trans-QTL association#

Output#

Nominal Association Results#

Multiple Testing Corrected Results:#

Interaction Association Results#

Multiple Testing Corrected Results:#

Region (Gene) Level Association Evidence#

Multiple Testing Corrected Results:#

Command interface#

Pipeline Code#

Anticipated Results#