Hierarchical Multiple Testing#

Description [FIXME]#

This protocol implements a three-step procedure:

  1. Local adjustment: p-values of all cis-variants adjusted within each gene

  2. Global adjustment: minimum adjusted p-values from Step 1 further adjusted across all genes

  3. Global informed identification of significant xQTL: xQTL with locally adjusted p-value below the threshold

It also reorganizes intermediate files produced in tensorQTL analysis to specified archive folder for book-keeping or deletion, to save space.

cf. Huang et al. 2018

Input#

  1. Output from TensorQTL in the form of *.cis_qtl.regional.tsv.gz files

  2. A gene coordinates tsv file with chr, start, end and gene_id columns

Output#

  1. *significant_qtl.filtered_bonferroni_BH_adjusted.tsv.gz

  2. *significant_qtl.original_bonferroni_BH_adjusted.tsv.gz

  3. *significant_qtl.permutation_adjusted.tsv.gz

  4. *significant_qtl.q_beta_adjusted_events_qvalue.tsv.gz

  5. *significant_events.filtered_bonferroni_BH_adjusted.tsv.gz

  6. *significant_events.original_bonferroni_BH_adjusted.tsv.gz

  7. *significant_events.permutation_adjusted.tsv.gz

  8. *summary.txt

  9. *multiple_testing_consolidated.rds

Minimal Working Example Steps#

i. Genetic Effect#

With

  1. permutation testing

  2. recounted number of variants limited to MAF and cis-windows for Bonferroni correction

We recommend reassessing significance of Bonferroni correction if previous analysis use too large of window or contain many low frequency variants. Below we reassess with MAF > 0.01 and cis-window size 1000000 up and downstreams of TSS/TES.

When maf_cutoff and cis_window are not zero, the program will first compute the number of variants after filtering by these metric and write files with n_variants_suffix in the same folder as QTL data, then use those numbers to create filtered list of variants for Bonferroni adjusted p-value.

sos run pipeline/qtl_association_postprocessing.ipynb default \
    --cwd output/tensorqtl_cis \
    --gene-coordinates data/look_up_gene_id.tsv \
    --sub-dir . \
    --tss_dist_col tss_distance \
    --tes_dist_col tes_distance \
    --pecotmr-path pecotmr \
    --maf-cutoff 0.01 --cis-window 1000000 \
    --regional-pattern "*.cis_qtl.regional.tsv.gz$" \
    --output-dir output/heirarchical_multi_test/output \
    --archive-dir output/heirarchical_multi_test/archive \
    -s force

ii. Interaction Effect#

With EigenMT method to get us the effective number of tests, it is recommended setting maf_cutoff and cis_window both to zero to tell the program to not recompute counts for Bonferroni correction,

The above approach seems recommended by tensorqtl package because it is their default option for interaction QTL analysis. We will also adopt it as a recommended practice.

To keep using variant count based Bonferroni correct, simply remove --regional-pattern,

sos run ~/GIT/xqtl-protocol/code/association_scan/qtl_association_postprocessing.ipynb default \
    --gene-coordinates look_up_gene_id.tsv \
    --cwd ~/Downloads/snuc_DeJager_Ast_tensorQTL_MWE --sub-dir "interaction/msex" \
    --maf-cutoff 0 --cis-window 0 \
    --pecotmr-path /path/to/pecotmr \
    --regional-pattern "*.cis_qtl_top_assoc.txt.gz$" \
    --pvalue-pattern "pvalue_.*_interaction" \
    --qvalue-pattern "qvalue_interaction" \
    --output-dir ~/output --archive-dir ~/archive -s force \
    --fdr-threshold 0.3 # set it large for illustration purpose

iii. Quantile QTL#

Here we summarize the significant QTL using both original variants as well as additionally considering MAF>0.05 and cis-window 1Mb around a gene, for Bonferroni adjusted p-value,

sos run ~/GIT/xqtl-protocol/code/association_scan/qtl_association_postprocessing.ipynb default \
    --gene-coordinates look_up_gene_id.tsv \
    --cwd ~/Downloads/QR_ROSMAP_Ast_mega_quantile_tensorQTL_MWE --sub-dir "." \
    --maf-cutoff 0.05 --cis-window 1000000 \
    --pecotmr-path /path/to/pecotmr \
    --af-col maf --tss-dist-col feature_tss --tes-dist-col feature_tes \
    --qvalue-pattern "^qvalue$" \
    --output-dir ~/output --archive-dir ~/archive -s force

Command interface#

sos run qtl_association_postprocessing.ipynb -h
usage: sos run /restricted/projectnb/xqtl/xqtl_protocol/xqtl-protocol/code/association_scan/qtl_association_postprocessing.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  default

Global Workflow Options:
  --cwd . (as path)
  --pecotmr-path VAL (as path, required)
                        path to pecotmr repo
  --gene-coordinates VAL (as path, required)
  --output-dir VAL (as path, required)
  --archive-dir VAL (as path, required)
  --sub-dir VAL (as path, required)
  --maf-cutoff 0.01 (as float)
  --cis-window 1000000 (as int)
  --tss-dist-col 'start_distance'
  --tes-dist-col 'end_distance'
  --af-col af
  --molecular-id-col 'molecular_trait_object_id'
  --pvalue-cutoff 0.05 (as float)
                        This is for selecting the subset of data to process on protential signals assuming we drop those above this threshold This might lead to underestimates in
                        qvalue method since qvalue < 0.05 may contain pvalue > 0.05
  --fdr-threshold 0.05 (as float)
                        This is used for both event and variant level significance filter
  --pvalue-pattern '^pvalue$'
                        eg "pvalue" for pvalue, "pvalue_.*_interaction" for interaction
  --qvalue-pattern '^qvalue$'
                        eg "qvalue" for pvalue, "qvalue_interaction" for interaction
  --regional-pattern NULL
                        eg "*.cis_qtl.regional.tsv.gz$" for genetic effect via our pipeline TensorQTL.ipynb "*.cis_qtl_top_assoc.txt.gz$" for interaction genetic effect
  --qtl-pattern '*.cis_qtl.pairs.tsv.gz$'
  --n-variants-suffix 'cis_n_variants_stats.tsv.gz'

Sections
  default:

Workflow implementation#

[global]
parameter: cwd = path(".")
#path to pecotmr repo
parameter: pecotmr_path = path
parameter: gene_coordinates = path
parameter: output_dir = path
parameter: archive_dir = path
parameter: sub_dir = path
parameter: maf_cutoff = 0.01
parameter: cis_window = 1000000
parameter: tss_dist_col = "start_distance"
parameter: tes_dist_col = "end_distance"
parameter: af_col = "af"
parameter: molecular_id_col = "molecular_trait_object_id"
# This is for selecting the subset of data to process on protential signals 
# assuming we drop those above this threshold
# This might lead to underestimates in qvalue method since qvalue < 0.05 may contain pvalue > 0.05
parameter: pvalue_cutoff = 0.05
# This is used for both event and variant level significance filter
parameter: fdr_threshold = 0.05
# eg "pvalue" for pvalue, "pvalue_.*_interaction" for interaction
parameter: pvalue_pattern = "^pvalue$"
# eg "qvalue" for pvalue, "qvalue_interaction" for interaction
parameter: qvalue_pattern = "^qvalue$"
# eg "*.cis_qtl.regional.tsv.gz$" for genetic effect via our pipeline TensorQTL.ipynb
# "*.cis_qtl_top_assoc.txt.gz$" for interaction genetic effect
parameter: regional_pattern = "NULL"
parameter: qtl_pattern = "*.cis_qtl.pairs.tsv.gz$"
parameter: n_variants_suffix = "cis_n_variants_stats.tsv.gz"

work_dir = f"{cwd:a}/{sub_dir}"
if sub_dir == path("."):
    output_dir = f"{output_dir:a}/{cwd:b}"
    archive_dir = f"{archive_dir:a}/{cwd:b}"
    work_dir = f"{cwd:a}"
else:
    output_dir = f"{output_dir:a}/{cwd:b}/{sub_dir}"
    archive_dir = f"{archive_dir:a}/{cwd:b}/{sub_dir}"
[default]
output: f"{output_dir}/{cwd:b}_multiple_testing_consolidated.rds"
task: trunk_workers = 1, tags = f'tensorqtl_postprocessing_{_output:n}'
R: expand = "${ }"

    params <- list()
    params$workdir           <- "${work_dir}"
    params$maf_cutoff        <- ${maf_cutoff}
    params$cis_window        <- ${cis_window}
    params$pvalue_cutoff     <- ${pvalue_cutoff}
    params$fdr_threshold     <- ${fdr_threshold}
    params$gene_coordinates  <- "${gene_coordinates:a}"
    params$output_dir        <- "${output_dir}"
    params$archive_dir       <- "${archive_dir}"
    params$regional_pattern  <- "${regional_pattern}"
    params$n_variants_suffix <- "${n_variants_suffix}"
    params$qtl_pattern       <- "${qtl_pattern}"
    params$pvalue_pattern    <- "${pvalue_pattern}"
    params$qvalue_pattern    <- "${qvalue_pattern}"
    params$start_distance_col <- "${tss_dist_col}"
    params$end_distance_col   <- "${tes_dist_col}"
    params$af_col <- "${af_col}"
    params$molecular_id_col <- "${molecular_id_col}"
    convert_null_strings <- function(params) {
      if (is.list(params)) {
        # Apply the function to each element in the list
        result <- lapply(params, convert_null_strings)
        return(result)
      } else {
        # For non-list elements, check if it's the string "NULL"
        if (is.character(params) && params == "NULL") {
          return(NULL)
        } else {
          return(params)
        }
      }
    }

    params <- convert_null_strings(params)
    source("${pecotmr_path}/inst/code/tensorqtl_postprocessor.R")
    results <- hierarchical_multiple_testing_correction(params)
    write_results(results, params$output_dir, params$workdir, to_cwd = "regional")
    archive_files(params)
    saveRDS(results, ${_output:r})