QTL Association Testing (TensorQTL)#
This notebook performs cis- and trans-QTL association testing using TensorQTL. For each molecular phenotype it tests association against nearby (cis) or genome-wide (trans) genetic variants and reports nominal and region-level statistics.
Description#
Two workflows are available:
cis: association between each molecular phenotype and variants within a window around it (default 1 Mb around the TSS). Produces nominal pair statistics plus region (gene) level significance.trans: association between phenotypes and variants on a chosen genotype chromosome, optionally restricted to a region list.
Inputs are the by-chromosome genotype/phenotype file lists and the covariate matrix produced by the genotype, phenotype, and covariate preprocessing steps.
Timing: cis ~1 min; trans ~1.5 min on the toy dataset.
Input Files#
Parameter |
Toy file |
Description |
|---|---|---|
|
|
Two-column list (chrom id, PLINK bed prefix) of by-chromosome genotypes |
|
|
List of bgzipped+tabixed molecular phenotype |
|
|
Covariate matrix (known covariates + hidden factors) |
|
|
(trans, optional) subset of regions/genes to analyze; gene name in column 4 |
# Load required libraries
library(data.table)
library(genio)
# ------------------------------------------------------------------
# Example phenotype file (MWE: protocol_example, chr22)
# ------------------------------------------------------------------
# A text file listing one phenotype .bed.gz per chromosome.
pheno_path <- fread("output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt")
head(pheno_path)
# One chromosome of phenotype data in BED-like format (.bed.gz).
pheno <- fread("output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr22.bed.gz")
pheno[1:5, 1:8]
| #id | #dir |
|---|---|
| <int> | <chr> |
| 22 | output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.chr22.bed.gz |
| #chr | start | end | ID | SAMPLE_001 | SAMPLE_002 | SAMPLE_003 | SAMPLE_004 |
|---|---|---|---|---|---|---|---|
| <chr> | <int> | <int> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| chr22 | 10939387 | 10961337 | ENSG00000283047 | 37.076840 | 0.00000 | 5.583274 | 0.0000000 |
| chr22 | 15528191 | 15529138 | ENSG00000130538 | 0.000000 | 1.08358 | 8.913977 | 4.0885045 |
| chr22 | 15611758 | 15613095 | ENSG00000231565 | 5.978084 | 0.00000 | 2.637113 | 10.4666167 |
| chr22 | 15615401 | 15615577 | ENSG00000226474 | 0.000000 | 0.00000 | 2.870204 | 0.0000000 |
| chr22 | 15749155 | 15750824 | ENSG00000235992 | 5.820253 | 0.00000 | 0.000000 | 0.5868365 |
Example genotype file#
# ------------------------------------------------------------------
# Example genotype file (MWE: protocol_example)
# ------------------------------------------------------------------
# A text file with two columns: chromosome ID and PLINK prefix path.
geno_path <- fread(file.path("output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt"))
head(geno_path)
# Read the merged PLINK set (.bed/.bim/.fam) for the toy sample.
plink_data <- read_plink("output/plink/protocol_example.genotype.merged.plink_qc")
genotypes <- plink_data$X
genotypes[1:5, 1:5]
| #id | #path |
|---|---|
| <int> | <chr> |
| 2 | /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.2.bed |
| 6 | /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.6.bed |
| 13 | /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.13.bed |
| 8 | /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.8.bed |
| 7 | /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.7.bed |
| 11 | /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example/output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.11.bed |
Reading: output/plink/protocol_example.genotype.merged.plink_qc.bim
Reading: output/plink/protocol_example.genotype.merged.plink_qc.fam
Reading: output/plink/protocol_example.genotype.merged.plink_qc.bed
| SAMPLE_001 | SAMPLE_002 | SAMPLE_003 | SAMPLE_004 | SAMPLE_005 | |
|---|---|---|---|---|---|
| chr1:113886_CTCTT_C | 0 | 0 | 0 | 0 | 0 |
| chr1:191223_C_* | 0 | 0 | 0 | 0 | 0 |
| chr1:191223_C_G | 0 | 0 | 0 | 0 | 0 |
| chr1:191223_C_T | 0 | 0 | 0 | 0 | 0 |
| chr1:275436_A_G | 0 | 0 | 0 | 0 | 0 |
Example covariates file#
# ------------------------------------------------------------------
# Example covariates file (MWE: protocol_example)
# ------------------------------------------------------------------
cov <- fread("output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz")
cov[, 1:5]
dim(cov)
| #id | SAMPLE_001 | SAMPLE_002 | SAMPLE_003 | SAMPLE_004 |
|---|---|---|---|---|
| <chr> | <dbl> | <dbl> | <dbl> | <dbl> |
| msex | 1.000000000 | 1.00000000 | 1.00000000 | 1.000000000 |
| age_death | 90.970000000 | 80.24000000 | 83.90000000 | 74.100000000 |
| pmi | 10.570000000 | 7.87000000 | 9.93000000 | 2.910000000 |
| study | 1.000000000 | 0.00000000 | 0.00000000 | 0.000000000 |
| PC1 | -0.233468563 | 0.01961037 | 0.11352412 | -0.040027540 |
| PC2 | 0.327658887 | 0.04553162 | -0.09470268 | -0.146030127 |
| PC3 | 0.098807354 | 0.16302273 | 0.04634796 | 0.026222852 |
| PC4 | -0.330900271 | 0.14254433 | 0.20275994 | -0.001605085 |
| PC5 | -0.310150069 | 0.10916400 | -0.18238419 | -0.120223005 |
| PC6 | -0.218843913 | -0.16840993 | -0.05711533 | -0.221893629 |
| PC7 | -0.040665022 | 0.07576872 | 0.14606728 | 0.136076636 |
| PC8 | -0.286365603 | 0.14011839 | -0.08323989 | 0.040693345 |
| PC9 | 0.078883981 | 0.14174392 | 0.13033626 | 0.332998336 |
| PC10 | -0.151439052 | -0.08918759 | 0.25286117 | -0.152520170 |
| PC11 | 0.118499177 | -0.05618921 | -0.18365505 | -0.038165749 |
| PC12 | 0.053282287 | -0.07951395 | 0.05349163 | -0.081934005 |
| PC13 | 0.115028477 | -0.01330522 | -0.01534750 | -0.070994143 |
| PC14 | 0.087845445 | -0.26853232 | -0.15027729 | 0.218599476 |
| PC15 | 0.001030374 | 0.07595356 | -0.08898327 | -0.181299348 |
| Hidden_Factor_PC1 | 0.851879348 | 0.98406483 | 5.18812380 | -5.667832700 |
| Hidden_Factor_PC2 | -2.825894932 | 1.69447908 | 0.06854215 | -4.619987817 |
| Hidden_Factor_PC3 | 3.763782166 | -0.80512535 | -5.77247349 | -3.825334825 |
| Hidden_Factor_PC4 | -1.301340188 | 3.15633800 | 2.42980902 | -3.266057955 |
| Hidden_Factor_PC5 | 0.556485272 | -1.26228531 | -3.94541314 | 2.110667490 |
| Hidden_Factor_PC6 | -4.493764002 | -0.11490919 | -2.51105454 | 0.278112795 |
| Hidden_Factor_PC7 | 2.846210676 | -0.42143392 | 0.34488558 | -3.151865035 |
| Hidden_Factor_PC8 | -1.876624232 | -9.65305786 | 4.25303685 | -1.043104423 |
| Hidden_Factor_PC9 | 3.126759781 | 0.48805759 | 5.52493678 | 1.428784448 |
- 28
- 61
Example customized cis window file (Optional)#
# ------------------------------------------------------------------
# Example customized cis-window file (Optional)
# ------------------------------------------------------------------
# Gene-level cis-window (start/end) defining the SNP search region per gene.
# Omit to use --window for a symmetric +/-1Mb window around each gene TSS.
window <- fread("input/reference_data/TAD/TADB_enhanced_cis.bed")
head(window)
| #chr | start | end | gene_id |
|---|---|---|---|
| <chr> | <int> | <int> | <chr> |
| chr1 | 0 | 6480000 | ENSG00000008128 |
| chr1 | 0 | 6480000 | ENSG00000008130 |
| chr1 | 0 | 6480000 | ENSG00000067606 |
| chr1 | 0 | 7101193 | ENSG00000069424 |
| chr1 | 0 | 7960000 | ENSG00000069812 |
| chr1 | 0 | 6480000 | ENSG00000078369 |
Example interaction file#
# ------------------------------------------------------------------
# Example interaction file (Optional, for iQTL analysis)
# ------------------------------------------------------------------
# Provide an interaction term as a separate file via --interaction
# when it is not already part of the covariate file.
int <- fread("input/ROSMAP_interaction_example.tsv")
head(int)
Error in fread("input/ROSMAP_interaction_example.tsv"): File 'input/ROSMAP_interaction_example.tsv' does not exist or is non-readable. getwd()=='/mnt/lustre/lab/gwang/home/rl3328/protocol_testing/xqtl-protocol_archived'
Traceback:
1. stopf("File '%s' does not exist or is non-readable. getwd()=='%s'",
. file, getwd())
2. raise_condition(stop, gettextf(fmt, ..., domain = domain), c(class,
. "simpleError", "error", "condition"))
3. signal(obj)
Output Files#
cis (output/tensorqtl_cis/):
File |
Description |
|---|---|
|
Nominal variant-phenotype association statistics |
|
Region (gene) level significance |
|
Summary of regional results |
|
Per-chromosome nominal results (parquet) |
trans (output/tensorqtl_trans/):
File |
Description |
|---|---|
|
Trans variant-phenotype association statistics |
Steps#
The commands below run on the included toy data and write results under --cwd. The genotype, phenotype, and covariate inputs are produced by the preprocessing notebooks.
cis-QTL association#
Test each molecular phenotype against variants within the cis window. Matching chromosomes between the genotype and phenotype lists are analyzed (chr22 in the toy data). --MAC 5 sets the minor-allele-count cutoff for the small toy sample.
sos run pipeline/TensorQTL.ipynb cis \
--genotype-file output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt \
--phenotype-file output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt \
--covariate-file output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
--cwd output/tensorqtl_cis --name protocol_example --MAC 5 --numThreads 2
trans-QTL association#
For trans-analysis:
Computational strategy designed for trans analysis:
Trans analysis faces significant memory challenges as we calculate all associations between all molecular traits × all genetic variants across the genome, creating a massive computational burden. To address this challenge, we implement a two-stage chromosome-based parallelization approach:
Stage 1 (trans_1): Chromosome-based parallelization
Phenotype data is processed per chromosome (e.g., 22 separate jobs for autosomes)
For each phenotype chromosome, we test associations against variants from all 22 chromosomes
This creates phenotype_chr × genotype_chr combinations (e.g., phenotype chr1 vs genotype chr1-22); Garbage was collected between each chromosome combination caculation to release memory
Results are combined across all chromosome combinations and saved as compressed files
Stage 2 (trans_2): Significance filtering
Supports p-value cutoffs (
--pvalue-cutoff) or q-value cutoffs (--qvalue-cutoff)
Test phenotypes against variants on a chosen genotype chromosome (--trans-geno-chromosome 22), restricted to the genes in --region-list. Trans analysis is memory-heavy genome-wide; here it is scoped to the toy chromosome.
sos run pipeline/TensorQTL.ipynb trans \
--genotype-file output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt \
--phenotype-file output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt \
--covariate-file output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
--cwd output/tensorqtl_trans --name protocol_example --MAC 5 --numThreads 2 \
--trans-geno-chromosome 22 --region-list data/combined_AD_genes.csv --region-list-phenotype-column 4
Output#
For each chromosome, several summary statistics files are generated, including both nominal test statistics for each test and region (gene) level association evidence.
Nominal Association Results#
The columns of the nominal association result are as follows:
chrom: Variant chromosome.
pos: Variant chromosomal position (basepairs).
molecular_trait_id: Molecular trait identifier (gene).
variant_id: ID of the variant (rsid or chr:position:ref:alt).
tss_distance: Distance of the SNP to the gene transcription start site (TSS).
tes_distance: Distance of the SNP to the gene transcription end site (TES).
cis_window_start_distance: Distance of the SNP to the start of the cis window (if using a customized cis window).
cis_window_end_distance: Distance of the SNP to the end of the cis window (if using a customized cis window).
af: The allele frequency of this SNP.
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
pvalue: Nominal P-value from linear regression.
bhat: Slope of the linear regression.
sebhat: Standard error of bhat.
n: Number of phenotypes after basic QC.
Multiple Testing Corrected Results:#
qvalue: Calculated q-value for each SNP (grouped by gene).
Interaction Association Results#
The columns of interaction association results are as follows (FIXME):
Model:
$\(
\text{phenotype} = \beta_0 + \beta_1 \cdot \text{snp} + \beta_2 \cdot \text{msex} + \beta_3 \cdot (\text{snp} \times \text{msex}) + \epsilon
\)$
(Taking msex as the interaction factor)
chrom: Chromosome number.
pos: Variant chromosomal position (basepairs).
a2: Variant reference allele (A, C, T, or G).
a1: Variant alternate allele.
molecular_trait_id: Molecular trait identifier, varies from phenotypes to phenotypes.
variant_id: ID of the top variant (rsid or chr:position:ref:alt).
af: Alternative allele frequency in the MiGA cohort.
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
pvalue: P-value of the main effect from the nonlinear regression.
bhat: Slope of the main effect from the nonlinear regression.
se: Standard error of beta.
pvalue_msex: P-value of the msex term from the nonlinear regression.
bhat_msex: Slope of the msex term from the nonlinear regression.
se_msex: Standard error of bhat_msex.
pvalue_msex_interaction: P-value of the interaction term from the nonlinear regression.
bhat_msex_interaction: Slope of the interaction term from the nonlinear regression.
se_msex_interaction: Standard error of beta_msex_interaction.
molecular_trait_object_id: An intermediate ID (can be ignored).
n: Number of samples.
Multiple Testing Corrected Results:#
qvalue_main: The q-value of the main effect.
qvalue_interaction: The q-value of the interaction effect.
Region (Gene) Level Association Evidence#
The column specifications for region-level association evidence are as follows:
chrom: Chromosome number.
pos: Variant chromosomal position (basepairs).
n_variant: Total number of variants tested in cis.
beta_shape1: First parameter value of the fitted beta distribution.
beta_shape2: Second parameter value of the fitted beta distribution.
true_df: Effective degrees of freedom of the beta distribution approximation.
p_true_df: Empirical P-value for the beta distribution approximation.
variant_id: ID of the top variant (rsid or chr:position:ref:alt).
tss_distance: Distance of the SNP to the gene transcription start site (TSS).
tes_distance: Distance of the SNP to the gene transcription end site (TES).
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
af: Alternative allele frequency.
p_nominal: Nominal P-value from linear regression.
bhat: Slope of the linear regression.
sehat: Standard error of the bhat.
p_perm: First permutation P-value directly obtained from the permutations with the direct method.
p_beta: Second permutation P-value obtained via beta approximation (this is the one to use for downstream analysis).
molecular_trait_object_id: Molecular trait identifier (gene).
n_traits: Group size in the permutation test.
genomic_inflation: Genomic inflation factor (lambda), quantifying the extent of bulk inflation and the excess false positive rate.
Multiple Testing Corrected Results:#
q_beta: Q-value for p_beta using Storey’s method (qvalue), more conservative than FDR.
q_perm: Q-value for p_perm using Storey’s method (qvalue), more conservative than FDR.
fdr_beta: Adjusted P-value for p_beta using the Benjamini-Hochberg method (FDR).
fdr_perm: Adjusted P-value for p_perm using the Benjamini-Hochberg method (FDR).
p_nominal_threshold: Nominal p-value threshold for variants in the corresponding molecular trait, derived from empirical beta distribution as a result of permutation testing.
Command interface#
List all workflows and options:
sos run pipeline/TensorQTL.ipynb -h
Pipeline Code#
The SoS workflow definitions below are unchanged from the original protocol.
[global]
parameter: modular_script_dir = path('code/script') # override with --modular-script-dir
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Phenotype file, or a list of phenotype per region.
parameter: phenotype_file = path
# A genotype file in PLINK binary format (bed/bam/fam) format, or a list of genotype per chrom
parameter: genotype_file = path
# Covariate file
parameter: covariate_file = path
# Optional pattern to filter covariates (list of covariate prefixes or exact names)
parameter: covariate_pattern = []
# Prefix for the analysis output
parameter: name = ""
# An optional subset of regions of molecular features to analyze. The last column is the gene names
parameter: region_list = path()
parameter: region_list_phenotype_column = 4
# Set list of sample to be keep
parameter: keep_sample = path()
# FIXME: please document
parameter: interaction = ""
# An optional list documenting the custom cis window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_cis_windows = path()
# The phenotype group file to group molecule_trait into molecule_trait_object
# This applies to multiple molecular events in the same region, such as sQTL analysis.
parameter: phenotype_group = path()
# The name of phenotype corresponding to gene_id or gene_name in the region
parameter: chromosome = []
# Minor allele count cutoff
parameter: MAC = 0
# Specify genotype chromosome for trans analysis (e.g. --trans-geno-chromosome 5)
# When set, trans step loads this genotype chrom instead of the one from input_files
parameter: trans_geno_chromosome = ""
# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# This parameter will be set to zero if `customized_cis_windows` is provided.
parameter: window = 1000000
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
parameter: entrypoint = ''
import re
# Use the header of the covariate file to decide the sample size
import pandas as pd
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1
# Minor allele frequency cutoff. It will overwrite minor allele cutoff.
# You may consider setting it to higher for interaction analysis if you have statistical power concerns
parameter: maf_threshold = MAC/(2.0*N)
# Filtering significant trans associations (for trans_2 workflow)
parameter: pvalue_cutoff = "5e-8"
parameter: qvalue_cutoff = ""
import os
import pandas as pd
def adapt_file_path(file_path, reference_file):
"""
Adapt a single file path based on its existence and a reference file's path.
Args:
- file_path (str): The file path to adapt.
- reference_file (str): File path to use as a reference for adaptation.
Returns:
- str: Adapted file path.
Raises:
- FileNotFoundError: If no valid file path is found.
"""
reference_path = os.path.dirname(reference_file)
# Check if the file exists
if os.path.isfile(file_path):
return file_path
# Check file name without path
file_name = os.path.basename(file_path)
if os.path.isfile(file_name):
return file_name
# Check file name in reference file's directory
file_in_ref_dir = os.path.join(reference_path, file_name)
if os.path.isfile(file_in_ref_dir):
return file_in_ref_dir
# Check original file path prefixed with reference file's directory
file_prefixed = os.path.join(reference_path, file_path)
if os.path.isfile(file_prefixed):
return file_prefixed
# If all checks fail, raise an error
raise FileNotFoundError(f"No valid path found for file: {file_path}")
def adapt_file_path_all(df, column_name, reference_file):
return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))
if (str(genotype_file).endswith("bed") or str(genotype_file).endswith("pgen")) and str(phenotype_file).endswith("bed.gz"):
input_files = [[phenotype_file, genotype_file]]
if len(chromosome) > 0:
input_chroms = [int(x) for x in chromosome]
else:
input_chroms = [0]
else:
import pandas as pd
import os
molecular_pheno_files = pd.read_csv(phenotype_file, sep = "\t")
if "#dir" in molecular_pheno_files.columns and "#chr" not in molecular_pheno_files.columns:
molecular_pheno_files = molecular_pheno_files.rename(columns={"#dir": "path"})
if "#id" in molecular_pheno_files.columns:
molecular_pheno_files = molecular_pheno_files.rename(columns={"#id": "#chr"})
if "#chr" in molecular_pheno_files.columns:
molecular_pheno_files = molecular_pheno_files.groupby(['#chr','path']).size().reset_index(name='count').drop("count",axis = 1).rename(columns = {"#chr":"#id"})
genotype_files = pd.read_csv(genotype_file,sep = "\t")
genotype_files["#id"] = [x.replace("chr","") for x in genotype_files["#id"].astype(str)] # e.g. remove chr1 to 1
genotype_files["#path"] = genotype_files["#path"].apply(lambda x: adapt_file_path(x, genotype_file))
molecular_pheno_files["#id"] = [x.replace("chr","") for x in molecular_pheno_files["#id"].astype(str)]
input_files = molecular_pheno_files.merge(genotype_files, on = "#id")
# Only keep chromosome specified in --chromosome
if len(chromosome) > 0:
input_files = input_files[input_files['#id'].isin(chromosome)]
input_files = input_files.values.tolist()
input_chroms = [x[0] for x in input_files]
input_files = [x[1:] for x in input_files]
if len(name) == 0:
name = f'{path(input_files[0][0]):bnn}' if len(input_files) == 1 else f'{path(input_files[0][0]):bnnn}'
[cis_1]
# parse input file lists
# skip nominal association results if the files exists already
# This is false by default which means to recompute everything
# This is only relevant when the `parquet` files for nominal results exist but not the other files
# and you want to avoid computing the nominal results again
parameter: skip_nominal_if_exist = False
parameter: permutation = True
# Extract interaction name
var_interaction = interaction
if os.path.isfile(interaction):
interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
var_interaction = interaction_s.columns[0] # interaction name
test_regional_association = permutation and len(var_interaction) == 0
input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output_files = dict([("parquet", f'{cwd:a}/{_input[0]:bnn}{"_%s" % var_interaction if interaction else ""}.cis_qtl_pairs.{"" if input_chroms[_index] == 0 else input_chroms[_index]}.parquet'), # This convention is necessary to match the pattern of map_norminal output
("nominal", f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.pairs.tsv.gz')])
if test_regional_association:
output_files["regional"] = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.regional.tsv.gz'
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output["nominal"]:bnnn}'
bash: expand= "${ }", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container, entrypoint = entrypoint
python3 ${modular_script_dir}/association_scan/TensorQTL/TensorQTL.py \
--step cis \
--cwd "${cwd}" \
--genotype-file "${_input[1]}" \
--phenotype-file "${_input[0]}" \
--covariate-file "${covariate_file}" \
--interaction "${interaction}" \
--chromosome ${input_chroms[_index]} \
--window ${window} \
--MAC ${MAC} \
--maf-threshold ${maf_threshold} \
--pvalue-cutoff "${pvalue_cutoff}" \
--qvalue-cutoff "${qvalue_cutoff}" \
--permutation ${permutation} \
${"--skip-nominal-if-exist" if skip_nominal_if_exist else ""} \
${"--covariate-pattern " + " ".join([str(x) for x in covariate_pattern]) if len(covariate_pattern) else ""} \
${"--region-list " + str(region_list) + " --region-list-phenotype-column " + str(region_list_phenotype_column) if region_list.is_file() else ""} \
${"--keep-sample " + str(keep_sample) if keep_sample.is_file() else ""} \
${"--customized-cis-windows " + str(customized_cis_windows) if customized_cis_windows.is_file() else ""} \
${"--phenotype-group " + str(phenotype_group) if phenotype_group.is_file() else ""} \
--skip-postprocess \
--numThreads ${numThreads}
[cis_2]
done_if("regional" not in _input.labels)
input: group_by = "all"
output_file_prefix = name if len(_input["nominal"]) > 1 else f'{_input["nominal"][0]:bnnnn}'
output: f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.tsv.gz',
f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.summary.txt'
input_files = [str(x) for x in _input["regional"]]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
python3 ${modular_script_dir}/association_scan/TensorQTL/TensorQTL.py \
--step cis_postprocess \
--cwd "${cwd}" \
--output-tsv "${_output[0]}" \
--output-summary "${_output[1]}" \
--regional-files ${" ".join(input_files)} \
--numThreads ${numThreads}
[trans]
parameter: batch_size = 10000
parameter: pval_threshold = 1.0
# Permutation testing is incorrect when the analysis is done by chrom
parameter: permutation = False
parameter: pval = 0.0
input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output: nominal = f'{cwd:a}/{_input[0]:bnn}{"_chr%s" % input_chroms[_index] if input_chroms[_index] != 0 else ""}{"_geno_chr%s" % trans_geno_chromosome if trans_geno_chromosome else ""}.trans_qtl{"_p_%.0e" % pval if pval > 0.0 else ""}.pairs.tsv.gz'
trans_genotype_file = genotype_file if trans_geno_chromosome else _input[1]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
python3 ${modular_script_dir}/association_scan/TensorQTL/TensorQTL.py \
--step trans \
--cwd "${cwd}" \
--genotype-file "${trans_genotype_file}" \
--phenotype-file "${_input[0]}" \
--covariate-file "${covariate_file}" \
--output "${_output['nominal']}" \
--trans-geno-chromosome "${trans_geno_chromosome}" \
--window ${window} \
--MAC ${MAC} \
--maf-threshold ${maf_threshold} \
--batch-size ${batch_size} \
--pval-threshold ${pval_threshold} \
--pval ${pval} \
--pvalue-cutoff "${pvalue_cutoff}" \
--qvalue-cutoff "${qvalue_cutoff}" \
${"--covariate-pattern " + " ".join([str(x) for x in covariate_pattern]) if len(covariate_pattern) else ""} \
${"--region-list " + str(region_list) + " --region-list-phenotype-column " + str(region_list_phenotype_column) if region_list.is_file() else ""} \
${"--keep-sample " + str(keep_sample) if keep_sample.is_file() else ""} \
${"--customized-cis-windows " + str(customized_cis_windows) if customized_cis_windows.is_file() else ""} \
--numThreads ${numThreads}
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.