QTL Association Testing#
This is an introductory, run-through notebook for the QTL association testing section of the xQTL protocol. If you are new to the protocol, start here: it explains what this section does, what inputs it needs, and walks you through each analysis on the included toy dataset so you can reproduce results end to end before applying the pipeline to your own data.
The heavy lifting is done by the worker notebook TensorQTL.ipynb. This section notebook simply orchestrates it for the three common analyses.
Prerequisites#
Requires phenotype matrices from molecular phenotype steps and covariates from covariate_preprocessing.
Description#
QTL association testing measures the statistical association between genetic variants and molecular phenotypes (e.g. gene expression). This section runs three flavours of that test, all through TensorQTL:
cis — each phenotype against nearby variants (within a window around the gene). This is the most common QTL scan.
trans — phenotypes against variants elsewhere in the genome (here, a chosen chromosome), optionally limited to a set of genes of interest.
interaction — a cis scan that additionally tests whether the genotype effect is modified by a covariate (e.g. sex).
Each analysis is an independent call to the same worker; you can run only the ones you need.
How it fits in the pipeline#
This section sits after the data-preprocessing sections. Its inputs are produced by:
genotype_preprocessing → by-chromosome PLINK genotypes and the file list,
phenotype_preprocessing → by-chromosome molecular phenotype
bed.gzfiles and the file list,covariate_preprocessing → the covariate matrix (known covariates + hidden factors).
The outputs of this section (cis/trans/interaction summary statistics) feed the downstream association postprocessing and fine-mapping / integration sections.
Input#
Make sure the preprocessing sections have been run so the following toy inputs exist. The commands below assume you launch them from the toy-example root directory.
Role |
File |
Produced by |
|---|---|---|
Genotypes (by chrom) |
|
genotype_preprocessing |
Phenotypes (by chrom) |
|
phenotype_preprocessing |
Covariates |
|
covariate_preprocessing |
Genes of interest (trans) |
|
provided with the toy data |
Cis-window definitions (optional) |
|
provided reference data |
You can confirm they are present by listing them:
Timing: Runtime varies by dataset size and compute resources. For the toy chr22 MWE dataset, most steps complete in under 10 minutes on a standard HPC node.
ls -la output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt \
output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt \
output/covariate/*Marchenko_PC.gz \
data/combined_AD_genes.csv reference_data/TAD/TADB_enhanced_cis.bed
Steps#
Each analysis below is a single command. They are independent — read the short explanation, then run the cell. On the toy data each takes roughly 1–2 minutes.
1. cis-QTL scan#
This is the standard scan: every molecular phenotype is tested against the variants within its cis window (1 Mb around the TSS by default). Genotype and phenotype lists are matched by chromosome (the toy data covers chr22). --MAC 5 is a relaxed minor-allele-count cutoff appropriate for the small toy sample. Results land in output/tensorqtl_cis/.
sos run pipeline/TensorQTL.ipynb cis \
--genotype-file output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt \
--phenotype-file output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt \
--covariate-file output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
--cwd output/tensorqtl_cis --name protocol_example --MAC 5 --numThreads 2
2. trans-QTL scan#
A trans scan tests phenotypes against variants on a chosen genotype chromosome (--trans-geno-chromosome 22). Because genome-wide trans testing is expensive, here we restrict the phenotypes to the genes listed in --region-list (gene name in column 4). Results land in output/tensorqtl_trans/.
sos run pipeline/TensorQTL.ipynb trans \
--genotype-file output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt \
--phenotype-file output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt \
--covariate-file output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
--cwd output/tensorqtl_trans --name protocol_example --MAC 5 --numThreads 2 \
--trans-geno-chromosome 22 --region-list data/combined_AD_genes.csv --region-list-phenotype-column 4
3. interaction (iQTL) scan#
An interaction scan is a cis scan that also tests whether the variant effect depends on a covariate. Here we use --interaction msex (the sex column in the toy covariate file; in the full protocol this column is named sex). --no-permutation skips the permutation step (permutation is not meaningful for a single interaction test on toy data), and --maf-threshold 0.05 sets the MAF cutoff. Results land in output/tensorqtl_int/.
sos run pipeline/TensorQTL.ipynb cis \
--genotype-file output/genotype_by_chrom/protocol_example.genotype.merged.plink_qc.genotype_by_chrom_files.txt \
--phenotype-file output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.txt \
--covariate-file output/covariate/protocol_example.rnaseq.bed.protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
--cwd output/tensorqtl_int --name protocol_example --MAC 5 --numThreads 2 \
--interaction msex --maf-threshold 0.05 --no-permutation
Output#
After running the cells above you will have three result folders:
Folder |
Key files |
Meaning |
|---|---|---|
|
|
cis nominal + region-level statistics |
|
|
trans nominal statistics |
|
|
interaction (iQTL) nominal statistics |
The nominal pairs files contain one row per variant–phenotype pair with effect size, standard error and p-value; the regional-significance files summarise the strongest signal per gene.
Next steps#
Proceed to association postprocessing to filter, calibrate and summarise these statistics, and then to the fine-mapping / multi-omics integration sections. For full parameter documentation of any analysis, open the worker notebook directly:
Anticipated Results#
Produces .cis_qtl_pairs.chr*.parquet files with nominal p-values and .cis_qtl.txt.gz with significant results. For the toy chr22 dataset expect hundreds of nominal associations.
sos run pipeline/TensorQTL.ipynb -h