Hierarchical Multiple Testing#

This is a runnable, toy-data example of the qtl_association_testing post-processing step. It performs hierarchical multiple-testing correction on the per-gene cis-QTL association results produced by TensorQTL, then archives the bulky intermediate files.

Description#

Given the per-gene association output from a cis-QTL scan, this pipeline implements a three-step hierarchical procedure: (1) local adjustment of the p-values of all cis-variants within each gene, (2) global adjustment of the minimum adjusted p-value across genes, and (3) identification of significant xQTL whose locally adjusted p-value falls below the threshold. It also reorganises the intermediate TensorQTL files into an archive folder for book-keeping or deletion.

The same default workflow handles three flavours of upstream results: the genetic (cis) effect (shown below), the interaction effect (run on *.cis_qtl_top_assoc.txt.gz with --maf-cutoff 0 --cis-window 0 and interaction p/q-value patterns), and quantile QTL. This example demonstrates the genetic-effect flavour on toy data. (cf. Huang et al. 2018)

Timing: ~1-2 min on typical compute infrastructure.

Input Files#

Role

Toy file

Produced by

Per-gene cis association

output/tensorqtl_cis/*.cis_qtl.regional.tsv.gz

TensorQTL cis scan

Full variant-level pairs

output/tensorqtl_cis/*.cis_qtl.pairs.tsv.gz

TensorQTL cis scan

Gene coordinates

input/reference_data/look_up_gene_id.tsv (chr, start, end, gene_id)

reference

Output Files#

Written under --output-dir (significance results) and --archive-dir (moved intermediates):

File

Meaning

*_multiple_testing_consolidated.rds

Consolidated multiple-testing results object

*.cis_regional.summary.txt

Per-gene regional summary after correction

*significant_qtl*.tsv.gz

Significant variant-level QTL (Bonferroni / BH / permutation)

*significant_events*.tsv.gz

Significant gene-level events

archive/.../*.parquet

Bulky TensorQTL intermediates moved out of the working dir

On the 6-sample toy data the correction runs cleanly and produces the consolidated object and summary; no events pass the genome-wide significance threshold, which is expected at this tiny scale.

Steps#

Run the hierarchical correction on the toy cis results. The MAF / cis-window options are non-zero, so the program recomputes the variant counts from the full pairs file before applying the Bonferroni correction. On this toy data the distance columns in the TensorQTL output are named tss_distance and tes_distance, so we point --tss-dist-col / --tes-dist-col at them.

sos run pipeline/qtl_association_postprocessing.ipynb default \
    --cwd output/tensorqtl_cis \
    --gene-coordinates input/reference_data/look_up_gene_id.tsv \
    --sub-dir . \
    --tss-dist-col tss_distance --tes-dist-col tes_distance \
    --maf-cutoff 0.01 --cis-window 1000000 \
    --regional-pattern "*.cis_qtl.regional.tsv.gz$" \
    --output-dir output/hierarchical_multi_test/output \
    --archive-dir output/hierarchical_multi_test/archive --enable-archive True \
    --pecotmr-path ../pecotmr \
    -s force

Command interface#

Show all workflow options and their defaults:

sos run pipeline/qtl_association_postprocessing.ipynb -h

Workflow implementation#

The cells below are the unmodified SoS workflow definition that the example above calls.

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

[global]
parameter: cwd = path(".")
#path to pecotmr repo
parameter: pecotmr_path = path
parameter: gene_coordinates = path
parameter: output_dir = path
parameter: archive_dir = path
parameter: sub_dir = path
parameter: maf_cutoff = 0.01
parameter: cis_window = 1000000
parameter: tss_dist_col = "start_distance"
parameter: tes_dist_col = "end_distance"
parameter: af_col = "af"
parameter: molecular_id_col = "molecular_trait_object_id"
# This is for selecting the subset of data to process on protential signals 
# assuming we drop those above this threshold
# This might lead to underestimates in qvalue method since qvalue < 0.05 may contain pvalue > 0.05
parameter: pvalue_cutoff = 0.05
# This is used for both event and variant level significance filter
parameter: fdr_threshold = 0.05
# eg "pvalue" for pvalue, "pvalue_.*_interaction" for interaction
parameter: pvalue_pattern = "^pvalue$"
# eg "qvalue" for pvalue, "qvalue_interaction" for interaction
parameter: qvalue_pattern = "^qvalue$"
# eg "*.cis_qtl.regional.tsv.gz$" for genetic effect via our pipeline TensorQTL.ipynb
# "*.cis_qtl_top_assoc.txt.gz$" for interaction genetic effect
parameter: regional_pattern = "NULL"
parameter: qtl_pattern = "*.cis_qtl.pairs.tsv.gz$"
parameter: n_variants_suffix = "cis_n_variants_stats.tsv.gz"
parameter: enable_archive = False
parameter: additional_pvalue_cols = ""

work_dir = f"{cwd:a}/{sub_dir}"
if sub_dir == path("."):
    output_dir = f"{output_dir:a}/{cwd:b}"
    archive_dir = f"{archive_dir:a}/{cwd:b}"
    work_dir = f"{cwd:a}"
else:
    output_dir = f"{output_dir:a}/{cwd:b}/{sub_dir}"
    archive_dir = f"{archive_dir:a}/{cwd:b}/{sub_dir}"
[default]
output: f"{output_dir}/{cwd:b}_multiple_testing_consolidated.rds"
task: trunk_workers = 1, tags = f'tensorqtl_postprocessing_{_output:n}'
R: expand = "${ }"

    params <- list()
    params$workdir           <- "${work_dir}"
    params$maf_cutoff        <- ${maf_cutoff}
    params$cis_window        <- ${cis_window}
    params$pvalue_cutoff     <- ${pvalue_cutoff}
    params$fdr_threshold     <- ${fdr_threshold}
    params$gene_coordinates  <- "${gene_coordinates:a}"
    params$output_dir        <- "${output_dir}"
    params$archive_dir       <- "${archive_dir}"
    params$regional_pattern  <- "${regional_pattern}"
    params$n_variants_suffix <- "${n_variants_suffix}"
    params$qtl_pattern       <- "${qtl_pattern}"
    params$pvalue_pattern    <- "${pvalue_pattern}"
    params$qvalue_pattern    <- "${qvalue_pattern}"
    params$start_distance_col <- "${tss_dist_col}"
    params$end_distance_col   <- "${tes_dist_col}"
    params$af_col <- "${af_col}"
    params$molecular_id_col <- "${molecular_id_col}"
    enable_archive_val <- "${enable_archive}"
    if (enable_archive_val %in% c("True", "TRUE", "true")) {
      params$enable_archive <- TRUE
    } else {
      params$enable_archive <- FALSE
    }
    message(sprintf("Archive setting - input: '%s', converted: %s", enable_archive_val, params$enable_archive))    
    params$additional_pvalue_cols <- "${additional_pvalue_cols}"    
    convert_null_strings <- function(params) {
      if (is.list(params)) {
        # Apply the function to each element in the list
        result <- lapply(params, convert_null_strings)
        return(result)
      } else {
        # For non-list elements, check if it's the string "NULL"
        if (is.character(params) && params == "NULL") {
          return(NULL)
        } else {
          return(params)
        }
      }
    }

    params <- convert_null_strings(params)
    source("${pecotmr_path}/inst/code/tensorqtl_postprocessor.R")
    results <- hierarchical_multiple_testing_correction(params)
    write_results(results, params$output_dir, params$workdir, to_cwd = "regional")
    if (params$enable_archive) {
      archive_files(params)
    }
    saveRDS(results, ${_output:r})