Hierarchical Multiple Testing#
This is a runnable, toy-data example of the qtl_association_testing post-processing step. It performs hierarchical multiple-testing correction on the per-gene cis-QTL association results produced by TensorQTL, then archives the bulky intermediate files.
Description#
Given the per-gene association output from a cis-QTL scan, this pipeline implements a three-step hierarchical procedure: (1) local adjustment of the p-values of all cis-variants within each gene, (2) global adjustment of the minimum adjusted p-value across genes, and (3) identification of significant xQTL whose locally adjusted p-value falls below the threshold. It also reorganises the intermediate TensorQTL files into an archive folder for book-keeping or deletion.
The same default workflow handles three flavours of upstream results: the genetic (cis) effect (shown below), the interaction effect (run on *.cis_qtl_top_assoc.txt.gz with --maf-cutoff 0 --cis-window 0 and interaction p/q-value patterns), and quantile QTL. This example demonstrates the genetic-effect flavour on toy data. (cf. Huang et al. 2018)
Timing: ~1-2 min on typical compute infrastructure.
Input Files#
Role |
Toy file |
Produced by |
|---|---|---|
Per-gene cis association |
|
TensorQTL cis scan |
Full variant-level pairs |
|
TensorQTL cis scan |
Gene coordinates |
|
reference |
Output Files#
Written under --output-dir (significance results) and --archive-dir (moved intermediates):
File |
Meaning |
|---|---|
|
Consolidated multiple-testing results object |
|
Per-gene regional summary after correction |
|
Significant variant-level QTL (Bonferroni / BH / permutation) |
|
Significant gene-level events |
|
Bulky TensorQTL intermediates moved out of the working dir |
On the 6-sample toy data the correction runs cleanly and produces the consolidated object and summary; no events pass the genome-wide significance threshold, which is expected at this tiny scale.
Steps#
Run the hierarchical correction on the toy cis results. The MAF / cis-window options are non-zero, so the program recomputes the variant counts from the full pairs file before applying the Bonferroni correction. On this toy data the distance columns in the TensorQTL output are named tss_distance and tes_distance, so we point --tss-dist-col / --tes-dist-col at them.
sos run pipeline/qtl_association_postprocessing.ipynb default \
--cwd output/tensorqtl_cis \
--gene-coordinates input/reference_data/look_up_gene_id.tsv \
--sub-dir . \
--tss-dist-col tss_distance --tes-dist-col tes_distance \
--maf-cutoff 0.01 --cis-window 1000000 \
--regional-pattern "*.cis_qtl.regional.tsv.gz$" \
--output-dir output/hierarchical_multi_test/output \
--archive-dir output/hierarchical_multi_test/archive --enable-archive True \
--pecotmr-path ../pecotmr \
-s force
Command interface#
Show all workflow options and their defaults:
sos run pipeline/qtl_association_postprocessing.ipynb -h
Workflow implementation#
The cells below are the unmodified SoS workflow definition that the example above calls.
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.
[global]
parameter: cwd = path(".")
#path to pecotmr repo
parameter: pecotmr_path = path
parameter: gene_coordinates = path
parameter: output_dir = path
parameter: archive_dir = path
parameter: sub_dir = path
parameter: maf_cutoff = 0.01
parameter: cis_window = 1000000
parameter: tss_dist_col = "start_distance"
parameter: tes_dist_col = "end_distance"
parameter: af_col = "af"
parameter: molecular_id_col = "molecular_trait_object_id"
# This is for selecting the subset of data to process on protential signals
# assuming we drop those above this threshold
# This might lead to underestimates in qvalue method since qvalue < 0.05 may contain pvalue > 0.05
parameter: pvalue_cutoff = 0.05
# This is used for both event and variant level significance filter
parameter: fdr_threshold = 0.05
# eg "pvalue" for pvalue, "pvalue_.*_interaction" for interaction
parameter: pvalue_pattern = "^pvalue$"
# eg "qvalue" for pvalue, "qvalue_interaction" for interaction
parameter: qvalue_pattern = "^qvalue$"
# eg "*.cis_qtl.regional.tsv.gz$" for genetic effect via our pipeline TensorQTL.ipynb
# "*.cis_qtl_top_assoc.txt.gz$" for interaction genetic effect
parameter: regional_pattern = "NULL"
parameter: qtl_pattern = "*.cis_qtl.pairs.tsv.gz$"
parameter: n_variants_suffix = "cis_n_variants_stats.tsv.gz"
parameter: enable_archive = False
parameter: additional_pvalue_cols = ""
work_dir = f"{cwd:a}/{sub_dir}"
if sub_dir == path("."):
output_dir = f"{output_dir:a}/{cwd:b}"
archive_dir = f"{archive_dir:a}/{cwd:b}"
work_dir = f"{cwd:a}"
else:
output_dir = f"{output_dir:a}/{cwd:b}/{sub_dir}"
archive_dir = f"{archive_dir:a}/{cwd:b}/{sub_dir}"
[default]
output: f"{output_dir}/{cwd:b}_multiple_testing_consolidated.rds"
task: trunk_workers = 1, tags = f'tensorqtl_postprocessing_{_output:n}'
R: expand = "${ }"
params <- list()
params$workdir <- "${work_dir}"
params$maf_cutoff <- ${maf_cutoff}
params$cis_window <- ${cis_window}
params$pvalue_cutoff <- ${pvalue_cutoff}
params$fdr_threshold <- ${fdr_threshold}
params$gene_coordinates <- "${gene_coordinates:a}"
params$output_dir <- "${output_dir}"
params$archive_dir <- "${archive_dir}"
params$regional_pattern <- "${regional_pattern}"
params$n_variants_suffix <- "${n_variants_suffix}"
params$qtl_pattern <- "${qtl_pattern}"
params$pvalue_pattern <- "${pvalue_pattern}"
params$qvalue_pattern <- "${qvalue_pattern}"
params$start_distance_col <- "${tss_dist_col}"
params$end_distance_col <- "${tes_dist_col}"
params$af_col <- "${af_col}"
params$molecular_id_col <- "${molecular_id_col}"
enable_archive_val <- "${enable_archive}"
if (enable_archive_val %in% c("True", "TRUE", "true")) {
params$enable_archive <- TRUE
} else {
params$enable_archive <- FALSE
}
message(sprintf("Archive setting - input: '%s', converted: %s", enable_archive_val, params$enable_archive))
params$additional_pvalue_cols <- "${additional_pvalue_cols}"
convert_null_strings <- function(params) {
if (is.list(params)) {
# Apply the function to each element in the list
result <- lapply(params, convert_null_strings)
return(result)
} else {
# For non-list elements, check if it's the string "NULL"
if (is.character(params) && params == "NULL") {
return(NULL)
} else {
return(params)
}
}
}
params <- convert_null_strings(params)
source("${pecotmr_path}/inst/code/tensorqtl_postprocessor.R")
results <- hierarchical_multiple_testing_correction(params)
write_results(results, params$output_dir, params$workdir, to_cwd = "regional")
if (params$enable_archive) {
archive_files(params)
}
saveRDS(results, ${_output:r})