QTL Association Testing#
Description#
We perform QTL association testing using TensorQTL [cf. Taylor-Weiner et al (2019)].
Input#
List of molecular phenotype files: a list of
bed.gz
files containing the table for the molecular phenotype. It should have a companion index file intbi
format. It is the output of gene_annotation or phenotype_by_chormList of genotypes in both PLINK binary format (bed/bim/fam) and PLINK 2 binary genotype table (pgen/pvar/psam) for each chromosome, previously processed through our genotype QC pipelines.
Covariate file, a file with #id + samples name as colnames and each row a covariate: fixed and known covariates as well as hidden covariates recovered from factor analysis.
Optionally, a list of traits (genes, regions of molecular features etc) to analyze.
Example phenotype list#
The header of the bed.gz is per the TensorQTL convention:
Phenotypes must be provided in BED format, sorted by chromosome and [start,end] position, with a single header line starting with # and the first four columns corresponding to: chr, start, end, phenotype_id, with the remaining columns corresponding to samples (the identifiers must match those in the genotype input). The BED file should specify the cis-window (usually the TSS), with start = the minimum start for each gene, end = the maximum end for each gene(extracted from phenotype referenced gtf file).
#chr start end ID path
chr1 631073 632616 ENSG00000237973 /home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
chr1 633695 634376 ENSG00000248527 /home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
chr1 825137 859446 ENSG00000228794 /home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
chr1 944202 959309 ENSG00000188976 /home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
chr1 975203 982093 ENSG00000187642 /home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
chr1 1216908 1232067 ENSG00000078808 /home/rl3328/motor_data/pheno_data/Quad/phenotype_by_chrom/genes_logcpm.bed.chr1.bed.gz
For cis-analysis:
Optionally, a list of genomic regions associate with each molecular features to analyze. The default cis-analysis will use a window around TSS. This can be customized to take given start and end genomic coordinates. we currently suggest using 1Mb window around a gene because longer customized cis-windows (such as extending by TAD) does not yield significant improvements.
Output#
For each chromosome, several summary statistics files are generated, including both nominal test statistics for each test and region (gene) level association evidence.
Nominal Association Results#
The columns of the nominal association result are as follows:
chrom: Variant chromosome.
pos: Variant chromosomal position (basepairs).
molecular_trait_id: Molecular trait identifier (gene).
variant_id: ID of the variant (rsid or chr:position:ref:alt).
tss_distance: Distance of the SNP to the gene transcription start site (TSS).
tes_distance: Distance of the SNP to the gene transcription end site (TES).
cis_window_start_distance: Distance of the SNP to the start of the cis window (if using a customized cis window).
cis_window_end_distance: Distance of the SNP to the end of the cis window (if using a customized cis window).
af: The allele frequency of this SNP.
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
pvalue: Nominal P-value from linear regression.
bhat: Slope of the linear regression.
sebhat: Standard error of bhat.
n: Number of phenotypes after basic QC.
Multiple Testing Corrected Results:#
qvalue: Calculated q-value for each SNP (grouped by gene).
Interaction Association Results#
The columns of interaction association results are as follows (FIXME):
Model:
$\(
\text{phenotype} = \beta_0 + \beta_1 \cdot \text{snp} + \beta_2 \cdot \text{msex} + \beta_3 \cdot (\text{snp} \times \text{msex}) + \epsilon
\)$
(Taking msex as the interaction factor)
chrom: Chromosome number.
pos: Variant chromosomal position (basepairs).
a2: Variant reference allele (A, C, T, or G).
a1: Variant alternate allele.
molecular_trait_id: Molecular trait identifier, varies from phenotypes to phenotypes.
variant_id: ID of the top variant (rsid or chr:position:ref:alt).
af: Alternative allele frequency in the MiGA cohort.
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
pvalue: P-value of the main effect from the nonlinear regression.
bhat: Slope of the main effect from the nonlinear regression.
se: Standard error of beta.
pvalue_msex: P-value of the msex term from the nonlinear regression.
bhat_msex: Slope of the msex term from the nonlinear regression.
se_msex: Standard error of bhat_msex.
pvalue_msex_interaction: P-value of the interaction term from the nonlinear regression.
bhat_msex_interaction: Slope of the interaction term from the nonlinear regression.
se_msex_interaction: Standard error of beta_msex_interaction.
molecular_trait_object_id: An intermediate ID (can be ignored).
n: Number of samples.
Multiple Testing Corrected Results:#
qvalue_main: The q-value of the main effect.
qvalue_interaction: The q-value of the interaction effect.
Region (Gene) Level Association Evidence#
The column specifications for region-level association evidence are as follows:
chrom: Chromosome number.
pos: Variant chromosomal position (basepairs).
n_variant: Total number of variants tested in cis.
beta_shape1: First parameter value of the fitted beta distribution.
beta_shape2: Second parameter value of the fitted beta distribution.
true_df: Effective degrees of freedom of the beta distribution approximation.
p_true_df: Empirical P-value for the beta distribution approximation.
variant_id: ID of the top variant (rsid or chr:position:ref:alt).
tss_distance: Distance of the SNP to the gene transcription start site (TSS).
tes_distance: Distance of the SNP to the gene transcription end site (TES).
ma_samples: Number of samples carrying the minor allele.
ma_count: Total number of minor alleles across individuals.
af: Alternative allele frequency.
p_nominal: Nominal P-value from linear regression.
bhat: Slope of the linear regression.
sehat: Standard error of the bhat.
p_perm: First permutation P-value directly obtained from the permutations with the direct method.
p_beta: Second permutation P-value obtained via beta approximation (this is the one to use for downstream analysis).
molecular_trait_object_id: Molecular trait identifier (gene).
n_traits: Group size in the permutation test.
genomic_inflation: Genomic inflation factor (lambda), quantifying the extent of bulk inflation and the excess false positive rate.
Multiple Testing Corrected Results:#
q_beta: Q-value for p_beta using Storey’s method (qvalue), more conservative than FDR.
q_perm: Q-value for p_perm using Storey’s method (qvalue), more conservative than FDR.
fdr_beta: Adjusted P-value for p_beta using the Benjamini-Hochberg method (FDR).
fdr_perm: Adjusted P-value for p_perm using the Benjamini-Hochberg method (FDR).
p_nominal_threshold: Nominal p-value threshold for variants in the corresponding molecular trait, derived from empirical beta distribution as a result of permutation testing.
Minimal Working Example Steps#
The data can be found on Synapse.
i. Cis TensorQTL Command#
sos run pipeline/TensorQTL.ipynb cis \
--genotype-file output/genotype_by_chrom/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt \
--phenotype-file output/phenotype_by_chrom/protocol_example.protein.bed.phenotype_by_chrom_files.region_list.txt \
--covariate-file output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
--customized_cis_windows prototype_example/protocol_example/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
--cwd output/cis_association/ \
--MAC 5
ii. Trans TensorQTL Command#
Note: Some protein is not in the customized cis windows list. There we will need to remove them from the analysis by create a region_list. Noted that the region list need to be a actual file. So <()
file is not acceptable.
zcat output/protocol_example.protein.bed.gz | cut -f 1,2,3,4 | grep -v -e ENSG00000163554 \
-e ENSG00000171564 -e ENSG00000171560 -e ENSG00000171557 > output/protocol_example.protein.region_list
The following will take more than 180G of memory to run.
sos run xqtl-protocol/pipeline/TensorQTL.ipynb trans \
--genotype-file output/genotype_by_chrom/protocol_example.genotype.chr21_22.genotype_by_chrom_files.txt \
--phenotype-file output/phenotype_by_chrom/protocol_example.protein.bed.phenotype_by_chrom_files.region_list.txt \
--region-list output/protocol_example.protein.region_list \
--covariate-file output/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.unrelated.plink_qc.prune.pca.Marchenko_PC.gz \
--customized-cis-windows input/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
--cwd output/association/trans/ \
--MAC 5
Troubleshooting#
Step |
Substep |
Problem |
Possible Reason |
Solution |
---|---|---|---|---|
Command Interface#
sos run TensorQTL.ipynb -h
usage: sos run TensorQTL.ipynb [workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
cis
trans
Global Workflow Options:
--cwd output (as path)
Path to the work directory of the analysis.
--phenotype-file VAL (as path, required)
Phenotype file, or a list of phenotype per region.
--genotype-file VAL (as path, required)
A genotype file in PLINK binary format (bed/bam/fam)
format, or a list of genotype per chrom
--covariate-file VAL (as path, required)
Covariate file
--name f"{phenotype_file:bn}_{covariate_file:bn}"
Prefix for the analysis output
--region-list . (as path)
An optional subset of regions of molecular features to
analyze. The last column is the gene names
--region-list-phenotype-column 4 (as int)
--keep-sample . (as path)
Set list of sample to be keep
--interaction ''
FIXME: please document
--customized-cis-windows . (as path)
An optional list documenting the custom cis window for
each region to analyze, with four column, chr, start,
end, region ID (eg gene ID). If this list is not
provided, the default `window` parameter (see below)
will be used.
--phenotype-group . (as path)
The phenotype group file to group molecule_trait into
molecule_trait_object This applies to multiple molecular
events in the same region, such as sQTL analysis.
--chromosome (as list)
The name of phenotype corresponding to gene_id or
gene_name in the region
--MAC 0 (as int)
Minor allele count cutoff
--window 1000000 (as int)
Specify the cis window for the up and downstream radius
to analyze around the region of interest in units of bp
This parameter will be set to zero if
`customized_cis_windows` is provided.
--numThreads 8 (as int)
Number of threads
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 12h
--mem 16G
--container ''
Container option for software to run the analysis:
docker or singularity
--entrypoint ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
--maf-threshold MAC/(2.0*N)
Minor allele frequency cutoff. It will overwrite minor
allele cutoff. You may consider setting it to higher for
interaction analysis if you have statistical power
concerns
Sections
cis_1:
Workflow Options:
--[no-]skip-nominal-if-exist (default to False)
parse input file lists skip nominal association results
if the files exists already This is false by default
which means to recompute everything This is only
relevant when the `parquet` files for nominal results
exist but not the other files and you want to avoid
computing the nominal results again
--[no-]permutation (default to True)
cis_2:
trans_1:
Workflow Options:
--batch-size 50000 (as int)
--pval-threshold 1.0 (as float)
--[no-]permutation (default to False)
Permutation testing is incorrect when the analysis is
done by chrom
Old Minimal working example#
An MWE is uploaded to google drive. The singularity image (sif) for running this MWE is uploaded to google drive
FIXME: need to update these links.
FIXME: Also need to update the example commands below using our new example dataset.
sos run pipeline/TensorQTL.ipynb cis \
--genotype-file plink_files_list.txt \
--phenotype-file MWE.bed.recipe \
--covariate-file ALL.covariate.pca.BiCV.cov.gz \
--cwd ./output/ \
--MAC 5
sos run pipeline/TensorQTL.ipynb trans \
--genotype-file plink_files_list.txt \
--phenotype-file MWE.bed.recipe \
--covariate-file ALL.covariate.pca.BiCV.cov.gz \
--cwd ./output/ \
--MAC 5 --region-name gene_name
sos run pipeline/TensorQTL.ipynb trans \
--genotype-file plink_files_list.txt \
--phenotype-file MWE.bed.recipe \
--covariate-file /mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/output/data_preprocessing/ALL/covariates/ALL.log2cpm.ALL.covariate.pca.resid.PEER.cov.gz \
--cwd ./output/trans_tensorQTL/ \
--region-list /mnt/vast/hpc/csg/snuc_pseudo_bulk/eight_tissue_analysis/reference_data/AD_genes.region_list \
--customized-cis-windows TADB_enhanced_cis.bed \
--MAC 5 --region-name ID
Setup and global parameters#
[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
# Phenotype file, or a list of phenotype per region.
parameter: phenotype_file = path
# A genotype file in PLINK binary format (bed/bam/fam) format, or a list of genotype per chrom
parameter: genotype_file = path
# Covariate file
parameter: covariate_file = path
# Prefix for the analysis output
parameter: name = ""
# An optional subset of regions of molecular features to analyze. The last column is the gene names
parameter: region_list = path()
parameter: region_list_phenotype_column = 4
# Set list of sample to be keep
parameter: keep_sample = path()
# FIXME: please document
parameter: interaction = ""
# An optional list documenting the custom cis window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
# If this list is not provided, the default `window` parameter (see below) will be used.
parameter: customized_cis_windows = path()
# The phenotype group file to group molecule_trait into molecule_trait_object
# This applies to multiple molecular events in the same region, such as sQTL analysis.
parameter: phenotype_group = path()
# The name of phenotype corresponding to gene_id or gene_name in the region
parameter: chromosome = []
# Minor allele count cutoff
parameter: MAC = 0
# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
# This parameter will be set to zero if `customized_cis_windows` is provided.
parameter: window = 1000000
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
# Container option for software to run the analysis: docker or singularity
parameter: container = ''
import re
# Use the header of the covariate file to decide the sample size
import pandas as pd
N = len(pd.read_csv(covariate_file, sep = "\t",nrows = 1).columns) - 1
# Minor allele frequency cutoff. It will overwrite minor allele cutoff.
# You may consider setting it to higher for interaction analysis if you have statistical power concerns
parameter: maf_threshold = MAC/(2.0*N)
# Filtering significant trans associations (for trans_2 workflow)
parameter: pvalue_cutoff = ""
parameter: qvalue_cutoff = "0.01" # default cutoff is 0.01
import os
import pandas as pd
def adapt_file_path(file_path, reference_file):
"""
Adapt a single file path based on its existence and a reference file's path.
Args:
- file_path (str): The file path to adapt.
- reference_file (str): File path to use as a reference for adaptation.
Returns:
- str: Adapted file path.
Raises:
- FileNotFoundError: If no valid file path is found.
"""
reference_path = os.path.dirname(reference_file)
# Check if the file exists
if os.path.isfile(file_path):
return file_path
# Check file name without path
file_name = os.path.basename(file_path)
if os.path.isfile(file_name):
return file_name
# Check file name in reference file's directory
file_in_ref_dir = os.path.join(reference_path, file_name)
if os.path.isfile(file_in_ref_dir):
return file_in_ref_dir
# Check original file path prefixed with reference file's directory
file_prefixed = os.path.join(reference_path, file_path)
if os.path.isfile(file_prefixed):
return file_prefixed
# If all checks fail, raise an error
raise FileNotFoundError(f"No valid path found for file: {file_path}")
def adapt_file_path_all(df, column_name, reference_file):
return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))
if (str(genotype_file).endswith("bed") or str(genotype_file).endswith("pgen")) and str(phenotype_file).endswith("bed.gz"):
input_files = [[phenotype_file, genotype_file]]
if len(chromosome) > 0:
input_chroms = [int(x) for x in chromosome]
else:
input_chroms = [0]
else:
import pandas as pd
import os
molecular_pheno_files = pd.read_csv(phenotype_file, sep = "\t")
if "#chr" in molecular_pheno_files.columns:
molecular_pheno_files = molecular_pheno_files.groupby(['#chr','path']).size().reset_index(name='count').drop("count",axis = 1).rename(columns = {"#chr":"#id"})
genotype_files = pd.read_csv(genotype_file,sep = "\t")
genotype_files["#id"] = [x.replace("chr","") for x in genotype_files["#id"].astype(str)] # e.g. remove chr1 to 1
genotype_files["#path"] = genotype_files["#path"].apply(lambda x: adapt_file_path(x, genotype_file))
molecular_pheno_files["#id"] = [x.replace("chr","") for x in molecular_pheno_files["#id"].astype(str)]
input_files = molecular_pheno_files.merge(genotype_files, on = "#id")
# Only keep chromosome specified in --chromosome
if len(chromosome) > 0:
input_files = input_files[input_files['#id'].isin(chromosome)]
input_files = input_files.values.tolist()
input_chroms = [x[0] for x in input_files]
input_files = [x[1:] for x in input_files]
if len(name) == 0:
name = f'{path(input_files[0][0]):bnn}' if len(input_files) == 1 else f'{path(input_files[0][0]):bnnn}'
cis-xQTL association testing#
[cis_1]
# parse input file lists
# skip nominal association results if the files exists already
# This is false by default which means to recompute everything
# This is only relevant when the `parquet` files for nominal results exist but not the other files
# and you want to avoid computing the nominal results again
parameter: skip_nominal_if_exist = False
parameter: permutation = True
# Extract interaction name
var_interaction = interaction
if os.path.isfile(interaction):
interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
var_interaction = interaction_s.columns[0] # interaction name
test_regional_association = permutation and len(var_interaction) == 0
input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output_files = dict([("parquet", f'{cwd:a}/{_input[0]:bnn}{"_%s" % var_interaction if interaction else ""}.cis_qtl_pairs.{"" if input_chroms[_index] == 0 else input_chroms[_index]}.parquet'), # This convention is necessary to match the pattern of map_norminal output
("nominal", f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.pairs.tsv.gz')])
if test_regional_association:
output_files["regional"] = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_chr%s" % input_chroms[_index]}{"_%s" % var_interaction if interaction else ""}.cis_qtl.regional.tsv.gz'
output: output_files
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output["nominal"]:bnnn}'
python: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout' , container = container
import pandas as pd
import numpy as np
import os
import tensorqtl
from tensorqtl import genotypeio, cis
from scipy.stats import chi2
import multiprocessing as mp
from tqdm import tqdm
## Define paths
plink_prefix_path = $[_input[1]:nar]
expression_bed = $[_input[0]:ar]
covariates_file = "$[covariate_file:a]"
window = $[window]
interaction = "$[interaction]"
## Load Data
phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
phenotype_id = phenotype_pos_df.index.name
## Analyze only the regions listed
if $[region_list.is_file()]:
region = pd.read_csv("$[region_list:a]", comment="#", header=None, sep="\t" )
phenotype_column = 1 if len(region.columns) == 1 else $[region_list_phenotype_column]
keep_region = region.iloc[:,phenotype_column-1].to_list()
phenotype_df = phenotype_df[phenotype_df.index.isin(keep_region)]
phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_region)]
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
genotype_df, variant_df = genotypeio.load_genotypes(plink_prefix_path, dosages = True)
## use custom sample list to subset the covariates data
if $[keep_sample.is_file()]:
sample_list = pd.read_csv("$[keep_sample:a]", comment="#", header=None, names=["sample_id"], sep="\t")
covariates_df.loc[sample_list.sample_id]
# Read interaction files or extract from covariates file.
var_interaction = interaction
interaction_s = []
if os.path.isfile(interaction):
# update var_interaction and interaction_s
interaction_s = pd.read_csv(interaction, sep='\t', index_col=0)
interaction_s = interaction_s[interaction_s.index.isin(covariates_df.index)]
var_interaction = interaction_s.columns[0] # interaction name
# check if the interaction term in interaction table is in covariates file, if yes and interaction_s not yet loaded then, extract it out from covariates file
if var_interaction in covariates_df.columns:
# only load from covariate if it has not been loaded yet
if len(interaction_s) == 0:
interaction_s = covariates_df[var_interaction].to_frame()
covariates_df = covariates_df.drop(columns=[var_interaction])
if len(interaction) and len(interaction_s) == 0:
raise ValueError(f"Cannot find interaction variable or file {interaction}")
# drop samples that with missing value in iteraction
if len(interaction_s):
interaction_s = interaction_s.dropna()
## Retaining only common samples
phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)]
if len(interaction_s):
phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, interaction_s.index)]
interaction_s = interaction_s[interaction_s.index.isin(phenotype_df.columns)]
interaction_s = interaction_s.loc[phenotype_df.columns]
covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
## To simplify things, there should really not be "chr" prefix
phenotype_pos_df.chr = phenotype_pos_df.chr.astype(str).str.replace("chr", "")
variant_df.chrom = variant_df.chrom.astype("str").str.replace("chr", "")
## use custom cis windows list
if $[customized_cis_windows.is_file()]:
cis_list = pd.read_csv("$[customized_cis_windows:a]", comment="#", header=None, names=["chr","start","end",phenotype_id], sep="\t")
cis_list.chr = cis_list.chr.astype(str).str.replace("chr", "") ## Again to simplify things for chr format concordance.
if cis_list[['chr', 'ID']].duplicated().sum() != 0: # if cis_list is not unique using identifier ['#chr', 'ID']
cis_list = cis_list.groupby(['ID', 'chr']).agg({ # use union start-end position and make cis_list unique
'start': 'min',
'end': 'max'
}).reset_index()
cis_list = cis_list[['chr', 'start', 'end', 'ID']]
#cis_list = cis_list.set_index('chr')
phenotype_pos_df = phenotype_pos_df.reset_index() #move the phenotype id index to a new column of the dataframe
phenotype_df = phenotype_df.reset_index()
# Ensure phenotype_pos_df is a subset of cis_list based on ['#chr', 'ID']
original_count = len(phenotype_pos_df)
phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.set_index(['chr', 'ID']).index.isin(cis_list.set_index([cis_list.columns[0],cis_list.columns[3]]).index)]
phenotype_df = phenotype_df[phenotype_df.set_index('ID').index.isin(cis_list.set_index(cis_list.columns[3]).index)]
phenotype_df = phenotype_df.set_index('ID')
removed_count = original_count - len(phenotype_pos_df)
print(f"{removed_count} rows were removed from phenotype_pos_df")
# Merge the dataframes on 'chr' and 'ID', including 'start' and 'end'
phenotype_pos_df = phenotype_pos_df.merge(cis_list[['chr', 'ID', 'start', 'end']],
left_on = ["chr",phenotype_id],
right_on = [cis_list.columns[0],cis_list.columns[3]],
suffixes=('_pheno', ''))
# Function to decide whether to keep or rename columns:
# If 'start' and 'end' values are the same in both dataframes, we keep the original columns without suffixes;
# If they're different, we name columns from cis_list with '_cis' suffixes and keep the name still for columns from phentype_pos_df.
# #in some cases (gene expression for eQTLs) the phenotype_id may be in the cis_list file
def rename_if_different(row, col):
if row[f'{col}'] == row[f'{col}_pheno']:
return row[col]
else:
return pd.Series({f'{col}_pheno': row[f'{col}_pheno'], f'{col}_cis': row[col]})
# Apply the renaming logic
for col in ['start', 'end']:
if f'{col}_pheno' in phenotype_pos_df.columns:# this condition is to not execute when the original phenotype_pos_df don't have start, end but pos column.
temp = phenotype_pos_df.apply(lambda row: rename_if_different(row, col), axis=1)
phenotype_pos_df = phenotype_pos_df.drop(columns=[f'{col}_pheno']).join(temp)
print(f"Dropped columns due to value mismatch: {col}_pheno")
phenotype_pos_df = phenotype_pos_df.set_index(phenotype_id)[["chr","start","end"]] # The final phenotype_pos_df will have three columns(chr, start, end) and index is the phenotype ID
if len(phenotype_df.index) != len(phenotype_pos_df.index):
raise ValueError("cannot uniquely match all the phentoype data in the input to the customized cis windows provided")
window = 0 # In the updated tensorQTL, by default if there is a customized cis window, the actual cis window will be start - window & end + window, so it is necessary to change the window parameter to 0
## Read phenotype group if availble
if $[phenotype_group.is_file()]:
group_s = pd.read_csv($[phenotype_group:r], sep='\t', header=None, index_col=0).squeeze()
else:
group_s = None
## cis-QTL mapping: nominal associations for all variant-phenotype pairs
if not ($[skip_nominal_if_exist] and $[_output["parquet"].is_file()]):
if len(interaction_s):
cis.map_nominal(genotype_df, variant_df,
phenotype_df,
phenotype_pos_df,
$[_output["parquet"]:nnnr],
covariates_df=covariates_df,
interaction_df=interaction_s,
maf_threshold_interaction=$[maf_threshold],
window=window,
group_s=group_s,
run_eigenmt=True)
else:
cis.map_nominal(genotype_df, variant_df,
phenotype_df,
phenotype_pos_df,
$[_output["parquet"]:nnnr],
covariates_df=covariates_df,
window=window,
maf_threshold=$[maf_threshold],
run_eigenmt=$['False' if permutation else 'True'],
group_s=group_s)
## Load the parquet and save it as txt
pairs_df = pd.read_parquet($[_output["parquet"]:r])
## Remove rows whose 'pval_gi' is null for following t_pval_conversion
if len(interaction_s):
pairs_df = pairs_df.dropna(subset=['pval_gi'])
# print general information of parquet
print('Output Information:')
print("This is the file containing the immediate output of TensorQTL's map_nominal function ")
print(os.path.getsize($[_output["parquet"]:r]))
## Adds the group columns to pairs_df, if there is group_s use group_s, else use phenotype_id
if group_s is not None:
pairs_df = pairs_df.merge(pd.DataFrame( {"molecular_trait_object_id": group_s}),left_on = "phenotype_id", right_index = True)
else:
pairs_df["molecular_trait_object_id"] = pairs_df.phenotype_id
## if pos in phenotype_pos_df(start distance and end distance are the same),
## add the column 'end_distance' with the same value as 'start_distance' to avoid mismatch of column and names
if 'end_distance' not in pairs_df.columns:
# Get the position of 'start_distance' column
start_pos = pairs_df.columns.get_loc('start_distance')
# Create a new DataFrame with the same values as start_distance
new_df = pairs_df.copy()
# Insert end_distance column after start_distance
new_df.insert(start_pos + 1, 'end_distance', pairs_df['start_distance'])
pairs_df = new_df
# rename columns
column_map = {'phenotype_id': 'molecular_trait_id'}
if len(interaction_s):
# calculate genomic inflation factor lambda on interaction
lambda_col_interaction = pairs_df.groupby("molecular_trait_object_id").apply(lambda x: chi2.ppf(1. - np.median(x.pval_gi), 1)/chi2.ppf(0.5,1))
column_map.update({
'pval_g': 'pvalue', 'b_g': 'bhat', 'b_g_se': 'se',
'pval_i': f'pvalue_{var_interaction}',
'b_i': f'bhat_{var_interaction}',
'b_i_se': f'sebhat_{var_interaction}',
'pval_gi': f'pvalue_{var_interaction}_interaction',
'b_gi': f'bhat_{var_interaction}_interaction',
'b_gi_se': f'sebhat_{var_interaction}_interaction'
})
else:
column_map.update({'pval_nominal': 'pvalue', 'slope': 'bhat', 'slope_se': 'sebhat'})
if $[customized_cis_windows.is_file()]:
column_map.update({'start_distance': 'cis_window_start_distance', 'end_distance': 'cis_window_end_distance'})
else:
column_map.update({'start_distance': 'tss_distance', 'end_distance': 'tes_distance'})
pairs_df.rename(columns=column_map, inplace=True)
pairs_df["n"] = len(phenotype_df.columns.values)
pairs_df = variant_df.merge(pairs_df, right_on='variant_id', left_index=True)
pairs_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
# sort the table if chrom and pos is not in ascending order
if not all(pairs_df['pos'].iloc[i] <= pairs_df['pos'].iloc[i+1] for i in range(len(pairs_df)-1)):
pairs_df = pairs_df.sort_values(by=['chrom', 'pos'])
# save file
pairs_df.to_csv($[_output["nominal"]:nr], sep='\t', index = None)
# print general information of pairs_df
print('Output Information:')
print("Output Rows:", len(pairs_df))
print("Output Columns:", pairs_df.columns.tolist())
print("Output Preview:", pairs_df.iloc[1:5, 1:10])
if $[test_regional_association]:
# calculate genomic inflation factor lambda for main variant effect
lambda_col = pairs_df.groupby("molecular_trait_object_id").apply(lambda x: chi2.ppf(1. - np.median(x.pvalue), 1)/chi2.ppf(0.5,1))
cis_df = cis.map_cis(genotype_df,
variant_df,
phenotype_df,
phenotype_pos_df,
covariates_df=covariates_df,
seed=999,
window=window,
maf_threshold = $[maf_threshold],
group_s=group_s)
cis_df.index.name = "molecular_trait_id"
## Add groups columns for eQTL analysis
if "group_id" not in cis_df.columns:
cis_df["group_id"] = cis_df.index
cis_df["group_size"] = 1
cis_df.rename(columns={"group_id": "molecular_trait_object_id", "group_size": "n_traits",
'start_distance': 'tss_distance', 'end_distance': 'tes_distance',
"num_var": "n_variants", "pval_nominal": "p_nominal",
'slope': 'bhat', 'slope_se': 'sebhat',
"pval_true_df": "p_true_df", "pval_perm": "p_perm", "pval_beta": "p_beta"}, inplace = True)
cis_df = cis_df.assign(genomic_inflation = lambda dataframe : dataframe["molecular_trait_object_id"].map(lambda molecular_trait_object_id:lambda_col[molecular_trait_object_id]))
# merge cis_df with variant_df
cis_df = variant_df.merge(cis_df, right_on='variant_id', left_index=True)
cis_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
# sort the table if chrom and pos is not in ascending order
if not all(cis_df['pos'].iloc[i] <= cis_df['pos'].iloc[i+1] for i in range(len(cis_df)-1)):
cis_df = cis_df.sort_values(by=['chrom', 'pos'])
# save file
cis_df.to_csv(str($[_output["nominal"]:nnnr])+str('.regional.tsv'), sep='\t', index = None)
# print general information of cis_df
print('Output Information:')
print("Output Rows:", len(cis_df))
print("Output Columns:", cis_df.columns.tolist())
print("Output Preview:", cis_df.iloc[0:5, 0:10])
R: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
library(purrr)
library(tidyr)
library(readr)
library(dplyr)
library(qvalue)
pairs_df = read_delim($[_output["nominal"]:nr,], delim = '\t')
compute_qvalues <- function(pvalues) {
tryCatch({
if(length(pvalues) < 2) {
return(pvalues)
} else {
return(qvalue(pvalues)$qvalues)
}
}, error = function(e) {
message("Too few p-values to calculate qvalue, fall back to BH")
qvalue(pvalues, pi0 = 1)$qvalues
})
}
var_interaction <- "$[interaction]"
# Check if 'interaction' is a file
if (file.exists(var_interaction)) {
# Read the file into 'interaction_s' dataframe
interaction_s <- read.delim(var_interaction, row.names = 1)
# Update 'var_interaction' to the first column name of 'interaction_s'
var_interaction <- names(interaction_s)[1]
}
if (is.null(var_interaction) || var_interaction == "") {
pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue = compute_qvalues(pvalue))
} else {
pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue_main = compute_qvalues(pvalue), qvalue_interaction = compute_qvalues($["pvalue_%s_interaction" % var_interaction]))
}
pairs_df %>% write_delim($[_output["nominal"]:nr],"\t")
bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
bgzip --compress-level 9 $[_output["nominal"]:n]
tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]]
done_if(not test_regional_association)
bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
bgzip --compress-level 9 $[_output["nominal"]:nnn].regional.tsv
tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]:nnn].regional.tsv.gz
[cis_2]
done_if("regional" not in _input.labels)
input: group_by = "all"
output_file_prefix = name if len(_input["nominal"]) > 1 else f'{_input["nominal"][0]:bnnnn}'
output: f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.tsv.gz',
f'{cwd}/{output_file_prefix}.cis_qtl_regional_significance.summary.txt'
input_files = [str(x) for x in _input["regional"]]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
library("purrr")
library("tidyr")
library("dplyr")
library("readr")
library("qvalue")
emprical_pd = tibble(map(c($[_input["regional"]:r,]), ~read_delim(.x,"\t")))%>%unnest()
emprical_pd["q_beta"] = tryCatch(qvalue(emprical_pd$p_beta)$qvalue, error = function(e){print("Too few pvalue to calculate qvalue, fall back to BH")
qvalue(emprical_pd$p_beta,pi0 = 1 )$qvalue})
emprical_pd["q_perm"] = tryCatch(qvalue(emprical_pd$p_perm)$qvalue, error = function(e){print("Too few pvalue to calculate qvalue, fall back to BH")
qvalue(emprical_pd$p_perm,pi0 = 1 )$qvalue})
emprical_pd["fdr_beta"] = p.adjust(emprical_pd$p_beta,"fdr")
emprical_pd["fdr_perm"] = p.adjust(emprical_pd$p_perm,"fdr")
# Calculate the global nominal p-value threshold based on q_beta at FDR 0.05
if (!all(is.na(emprical_pd$p_beta))) {
lb <- emprical_pd %>%
filter(q_beta <= 0.05) %>%
pull(p_beta) %>%
sort()
ub <- emprical_pd %>%
filter(q_beta > 0.05) %>%
pull(p_beta) %>%
sort()
if (length(lb) > 0) {
lb_val <- tail(lb, 1)
threshold <- if (length(ub) > 0) (lb_val + head(ub, 1)) / 2 else lb_val
message(sprintf("min p-value threshold @ FDR 0.05: %g", threshold))
emprical_pd <- emprical_pd %>%
mutate(p_nominal_threshold = qbeta(threshold, beta_shape1, beta_shape2))
}
}
summary = tibble("fdr_perm_0.05" = sum(emprical_pd["fdr_perm"] < 0.05),
"fdr_beta_0.05" = sum(emprical_pd["fdr_beta"] < 0.05),
"q_perm_0.05" = sum(emprical_pd["q_perm"] < 0.05),
"q_beta_0.05" = sum(emprical_pd["q_beta"] < 0.05),
"fdr_perm_0.01" = sum(emprical_pd["fdr_perm"] < 0.01),
"fdr_beta_0.01" = sum(emprical_pd["fdr_beta"] < 0.01),
"q_perm_0.01" = sum(emprical_pd["q_perm"] < 0.01),
"q_beta_0.01" = sum(emprical_pd["q_beta"] < 0.01) )
emprical_pd%>%write_delim("$[_output[0]]","\t")
summary%>%write_delim("$[_output[1]]","\t")
Trans-xQTL association testing#
For transQTL analysis, if you output all the p-values for many genes (default setting) it is suggested to provide the largest memory and CPU threads available on a compute node. eg 250G and >32 threads.
[trans_1]
input: input_files, group_by = len(input_files[0]), group_with = "input_chroms"
output: nominal = f'{cwd:a}/{_input[0]:bnn}{"" if input_chroms[_index] == 0 else "_%s" % input_chroms[_index]}.trans_qtl.pairs.tsv.gz'
parameter: batch_size = 10000
parameter: pval_threshold = 1.0
# Permutation testing is incorrect when the analysis is done by chrom
parameter: permutation = False
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container =container
import pandas as pd
import numpy as np
import tensorqtl
from tensorqtl import genotypeio, trans
from scipy.stats import chi2
## Define paths
plink_prefix_path = $[_input[1]:nar]
expression_bed = $[_input[0]:ar]
covariates_file = "$[covariate_file:a]"
window = $[window]
## Loading Data
phenotype_df, phenotype_pos_df = tensorqtl.read_phenotype_bed(expression_bed)
phenotype_id = phenotype_pos_df.index.name
## Analyze only the regions listed
if $[region_list.is_file()]:
region = pd.read_csv("$[region_list:a]", comment="#", header=None,sep="\t")
phenotype_column = 1 if len(region.columns) == 1 else $[region_list_phenotype_column]
keep_region = region.iloc[:,phenotype_column-1].to_list()
phenotype_df = phenotype_df[phenotype_df.index.isin(keep_region)]
phenotype_pos_df = phenotype_pos_df[phenotype_pos_df.index.isin(keep_region)]
## use custom cis windows
if $[customized_cis_windows.is_file()]:
cis_list = pd.read_csv("$[customized_cis_windows:a]", comment="#", header=None, names=["chr","start","end",phenotype_id], sep="\t")
phenotype_pos_df_reset = phenotype_pos_df.reset_index() #move the phenotype id index to a new column of the dataframe
phenotype_pos_df = phenotype_pos_df_reset.merge(cis_list, left_on = ["chr",phenotype_id],right_on = [cis_list.columns[0],cis_list.columns[3]])#in some cases (gene expression for eQTLs) the phenotype_id may be in the cis_list file
if len(phenotype_df.index) - phenotype_pos_df_reset[~phenotype_pos_df_reset[phenotype_id].isin(cis_list[phenotype_id])].shape[0]!= len(phenotype_pos_df.index):
raise ValueError("cannot uniquely match all the phentoype data in the input to the customized cis windows provided")
phenotype_pos_df = phenotype_pos_df.set_index(phenotype_id)[["chr","start","end"]] # The final phenotype_pos_df will have three columns(chr, start, end) and index is the phenotype ID
window = 0 # In the updated tensorQTL, by default if there is a customized cis window, the actual cis window will be start - window & end + window, so it is necessary to change the window parameter to 0
## Retaining only traits in cis_list
if phenotype_pos_df_reset[~phenotype_pos_df_reset[phenotype_id].isin(cis_list[phenotype_id])].shape[0] != 0:
phenotype_df = phenotype_df.loc[phenotype_df.index.isin(cis_list[phenotype_id])]
covariates_df = pd.read_csv(covariates_file, sep='\t', index_col=0).T
## use custom sample list to subset the covariates data
if $[keep_sample.is_file()]:
sample_list = pd.read_csv("$[keep_sample:a]", comment="#", header=None, names=["sample_id"], sep="\t")
covariates_df.loc[sample_list.sample_id]
pr = genotypeio.PlinkReader(plink_prefix_path)
genotype_df = pr.load_genotypes()
variant_df = pr.bim.set_index('snp')[['chrom', 'pos','a0','a1']]
## Retaining only common samples
phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, covariates_df.index)]
phenotype_df = phenotype_df[np.intersect1d(phenotype_df.columns, genotype_df.columns)]
covariates_df = covariates_df.transpose()[np.intersect1d(phenotype_df.columns, covariates_df.index)].transpose()
## To simplify things, there should really not be "chr" prefix
phenotype_pos_df.chr = [x.replace("chr","") for x in phenotype_pos_df.chr]
variant_df.chrom = [x.replace("chr","") for x in variant_df.chrom]
## Trans analysis
trans_df = trans.map_trans(genotype_df,
phenotype_df,
covariates_df,
batch_size=$[batch_size],
return_sparse=True,
return_r2 = True,
pval_threshold=$[pval_threshold],
maf_threshold=$[maf_threshold])
## Filter out cis signal, again if customized cis windows are used, the windows is [start-win,end + win] where win = 0, else it is [start - win, start + win]
trans_df = trans.filter_cis(trans_df, phenotype_pos_df, variant_df, window=window)
## Permutation
if $['True' if permutation else 'False']:
perm_df = trans.map_permutations(genotype_df, covariates_df, batch_size=$[batch_size],
maf_threshold=$[maf_threshold])
trans.apply_permutations(perm_df,trans_df)
## Output
trans_df.rename(columns={"phenotype_id": "molecular_trait_id",
"pval": "pvalue",
"b": "bhat", "b_se": "sebhat"}, inplace=True)
trans_df["n"] = len(phenotype_df.columns.values)
trans_df = variant_df.merge(trans_df, right_on='variant_id', left_index=True)
trans_df.rename(columns={'a1': 'a2', 'a0': 'a1'}, inplace=True)
# genomic inflation factor
lambda_col = trans_df.groupby("molecular_trait_id").apply(lambda x: chi2.ppf(1. - np.median(x.pvalue), 1)/chi2.ppf(0.5,1))
# Convert the Series to a DataFrame
lambda_col = lambda_col.reset_index()
lambda_col.columns = ['molecular_trait_id', 'genomic_inflation_lambda']
lambda_col.to_csv("$[_output:nnn].genomic_inflation.tsv.gz", sep='\t', index = None, compression={'method': 'gzip', 'compresslevel': 9})
# sort the table if chrom and pos is not in ascending order
if not all(trans_df['pos'].iloc[i] <= trans_df['pos'].iloc[i+1] for i in range(len(trans_df)-1)):
trans_df = trans_df.sort_values(by=['chrom', 'pos'])
# print general information of trans_df
print('Output Information:')
print("Output Rows:", len(trans_df))
print("Output Columns:", trans_df.columns.tolist())
print("Output Preview:", trans_df.iloc[0:5, 0:10])
# save output
trans_df.to_csv($[_output["nominal"]:nr], sep='\t', index = None)
R: expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container =container
if ($['FALSE' if permutation and pval_threshold < 1.0 else 'TRUE']){
library(purrr)
library(tidyr)
library(readr)
library(dplyr)
library(qvalue)
# calculate qvalue
pairs_df = read_delim($[_output["nominal"]:nr,], delim = '\t')
pairs_df = pairs_df %>% group_by(molecular_trait_id) %>% mutate(qvalue = qvalue(pvalue)$qvalues)
pairs_df %>% write_delim($[_output["nominal"]:nr],"\t")
}
bash: expand= "$[ ]", stderr = f'{_output["nominal"]:nnn}.stderr', stdout = f'{_output["nominal"]:nnn}.stdout', container = container
bgzip --compress-level 9 $[_output["nominal"]:n]
tabix -S 1 -s 1 -b 2 -e 2 $[_output["nominal"]]
[trans_2]
# This workflow filters significant trans-QTL associations based on p-value or q-value cutoffs
input: group_by = "all"
output: f'{cwd}/{name}_{("qvalue_%" % qvalue_cutoff) if qvalue_cutoff else ("pvalue_%s" % pvalue_cutoff)}_trans_pairs.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand="$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
#!/bin/bash
set -e
home_dir="$[cwd]"
pvalue_cutoff="$[pvalue_cutoff]"
qvalue_cutoff="$[qvalue_cutoff]"
first_file=$(ls "${home_dir}"/*trans_qtl.pairs.tsv.gz | head -n 1)
# Determine cutoff value
if [ -n "$pvalue_cutoff" ] && [ -n "$qvalue_cutoff" ]; then
echo "Both p-value and q-value cutoffs provided, using q-value cutoff"
cutoff_value="$qvalue_cutoff"
cutoff_col=$(zcat "$first_file" | awk -F '\t' -v col='qvalue' 'NR==1{for (i=1; i<=NF; i++) if ($i==col) {print i;exit}}')
elif [ -n "$pvalue_cutoff" ]; then
cutoff_value="$pvalue_cutoff"
cutoff_col=$(zcat "$first_file" | awk -F '\t' -v col='pvalue' 'NR==1{for (i=1; i<=NF; i++) if ($i==col) {print i;exit}}')
elif [ -n "$qvalue_cutoff" ]; then
cutoff_value="$qvalue_cutoff"
cutoff_col=$(zcat "$first_file" | awk -F '\t' -v col='qvalue' 'NR==1{for (i=1; i<=NF; i++) if ($i==col) {print i;exit}}')
else
echo "No cutoff provided, using default q-value cutoff of 0.01"
cutoff_value="0.01"
cutoff_col=$(zcat "$first_file" | awk -F '\t' -v col='qvalue' 'NR==1{for (i=1; i<=NF; i++) if ($i==col) {print i;exit}}')
fi
echo "Using column $cutoff_col with cutoff value $cutoff_value"
output_file="$[_output[0]]"
mkdir -p $(dirname "$output_file")
zcat "$first_file" | head -n 1 > "$output_file"
for gz_file in "${home_dir}"/*trans_qtl.pairs.tsv.gz; do
echo "Processing $gz_file..."
# Append significant data from each .gz file to the collective file
zcat "$gz_file" | awk -v cutoff="$cutoff_value" -v col="$cutoff_col" 'NR == 1 {next} NR > 1 && $col < cutoff' >> "$output_file"
done
echo "Total significant associations found: $(wc -l < "$output_file") (including header)"