Skip to contents

Overview

This vignette picks up where the TWAS weights vignette leaves off: a TWASWeights S4 object with per-gene per-variant weights already exists. The remaining task is to combine those weights with GWAS summary statistics to obtain a per-gene TWAS Z-score.

pecotmr exposes this inference at three layers:

Layer Entry point When to use
File-based pipeline twas_pipeline() Production: per-gene weights database + tabix-indexed GWAS sumstats + LD sketch genotypes; outputs per-gene-per-context-per-study TWAS results with imputability gating, method fallback, and optional MR.
Per-gene inference twas_analysis() In-memory: applies many weight methods to one gene’s GWAS sumstats and combines per-method p-values (ACAT / HMP / Fisher / Stouffer / etc.) into an omnibus test.
Single-method math twas_z(), twas_joint_z() When you already have aligned weights, z, and an LD sketch. Convenient for custom workflows or method comparisons.

The TWAS Z-score formula

Per single weight vector w and GWAS z-score vector z, the TWAS test statistic is

z_\text{TWAS} \;=\; \frac{w^\top z}{\sqrt{w^\top R\, w}}

where R is the LD matrix at the weight variants. Under the null, z_TWAS follows a standard normal.

Eigenvalue / eigenvector reparameterisation

pecotmr evaluates the denominator from the full SVD of the standardised LD sketch genotype matrix rather than materialising R. If X = U D V^\top is the (untruncated) SVD of the standardised sketch, then

R \;=\; \frac{X^\top X}{n_\text{sketch}-1} \;=\; V\, \Lambda\, V^\top, \qquad \Lambda_i \;=\; \frac{D_i^{2}}{n_\text{sketch}-1},

and

w^\top R\, w \;=\; \sum_{i=1}^{r} \Lambda_i \,\bigl(V^\top w\bigr)_i^{2}, \qquad r \;=\; \mathrm{rank}(X).

The sum runs over every nonzero singular value — no top-K truncation. The two expressions are algebraically equivalent (and numerically agree to floating-point precision when both are computed), so this is a reparameterisation, not an approximation.

We use this form for two practical reasons:

  • The SVD is computed once per gene region and reused across every method × condition × study evaluation. Each subsequent denominator costs O(r p) for V^\top w plus O(r) for the weighted sum, versus O(p^2) to multiply against an explicit R.
  • When the LD sketch has n_\text{sketch} \le p, R is rank-deficient (rank r, with n_\text{sketch} \ge r). Working in the r-dimensional eigenbasis represents exactly the information the sketch carries and avoids forming a singular p \times p matrix.

In practice this means twas_z() and twas_joint_z() consume (V, D, n_\text{sketch}) directly from the per-gene LD sketch — R is never formed at any step of the pipeline.

Single-method inference: twas_z()

Given aligned weights, GWAS z-scores, and the LD sketch components, twas_z() returns the per-gene Z and two-sided p-value:

# Inputs from the upstream pipelines (twas-weights.html and earlier):
#   w        : numeric vector, one weight per variant for one gene/method
#   gwas_z   : numeric vector of GWAS z-scores, aligned to the same variants
#   V, D, n  : right singular vectors, singular values, sample size of the
#              LD sketch for the gene region

res <- twas_z(weights = w, z = gwas_z,
              V = V, D = D, n_sketch = n)
res$z       # TWAS z-score
res$pval    # two-sided p-value

V, D, and n_sketch come from the LD sketch the pipeline already loads for each gene (see “The file-based pipeline” below); you do not need to compute them yourself in the normal workflow.

Combining multiple methods per gene: twas_analysis()

When the gene’s TWASWeights carries weights for several methods (the default for both twas_weights_pipeline() and twas_weights_sumstat_pipeline()), twas_analysis() applies each method’s weights to the GWAS and optionally combines the per-method p-values into an omnibus test:

# weights_matrix: variants x methods (one column per weight method)
# gwas_sumstats_db: data.frame with variant_id, z, ... for the GWAS
# V, D, n_sketch, ld_variant_ids: LD sketch components for the gene region

res <- twas_analysis(
  weights_matrix       = weights_matrix,
  gwas_sumstats_db     = gwas_sumstats_db,
  extract_variants_objs = rownames(weights_matrix),
  V = V, D = D, n_sketch = n_sketch, ld_variant_ids = ld_variant_ids,
  combine_method       = "acat",
  combine_if_no_cv     = TRUE
)

The output is a list keyed by method, each entry containing z and pval. When combine_if_no_cv = TRUE and at least two methods produce valid p-values, an omnibus entry is appended using the chosen combine method:

combine_method Combination
"acat" (default) Cauchy combination
"hmp" Harmonic mean p-value
"fisher" Fisher’s method
"stouffer" Stouffer’s Z
"invchisq" Inverse chi-square
"gbj" Generalized Berk-Jones
"aspu" Adaptive sum of powered scores
"gates" GATES gene-based test

For summary-statistics TWAS where no cross-validation performance is available to gate methods, combine_if_no_cv = TRUE is the standard default. For individual-level TWAS where CV performance has selected a best method per gene (twas_pipeline() handles this automatically), the omnibus is generally not needed.

Multi-condition joint test: twas_joint_z()

For per-condition weights (the output of twas_multivariate_weights_pipeline() or twas_multivariate_weights_sumstat_pipeline()), twas_joint_z() computes a per-condition TWAS Z plus a cross-condition joint test using any of the same p-value combination methods as twas_analysis():

# weights: variants x conditions matrix
# gwas_z:  GWAS z-scores aligned to the rows of `weights`
# V, D_svd, n_sketch: LD sketch SVD for the gene region

joint <- twas_joint_z(
  weights        = weights,
  z              = gwas_z,
  V              = V,
  D_svd          = D_svd,
  n_sketch       = n_sketch,
  combine_method = "acat"   # or "hmp", "fisher", "stouffer", "invchisq",
                            # "gbj", "aspu", "gates"
)

joint$Z                          # per-condition z + p-value
joint$combined$method            # the combine method used
joint$combined$pval              # joint cross-condition p-value

All eight methods listed in the twas_analysis() table above are accepted. The cross-condition correlation matrix induced by the weights and the LD sketch is computed internally and passed to whichever method needs it (Fisher, Stouffer, inverse chi-square, GBJ, aSPU, GATES); ACAT and HMP work on the p-values directly.

The file-based pipeline: twas_pipeline()

For production use you generally have:

  • A weights database produced by twas_weights_pipeline() (or the summary-statistics analog), saved as one entry per gene-region.
  • Tabix-indexed GWAS sumstats files, one per study.
  • An LD metadata TSV pointing at LD sketch genotype files (one per-chromosome entry, format-agnostic — the loader resolves the underlying genotype representation).

twas_pipeline() consumes all three and produces a per-gene-per-context-per-study TWAS results table, plus an optional MR analysis layer.

Input formats

gwas_meta_file — a TSV with these columns:

Column Meaning
study_id Identifier for the GWAS study
chrom Integer chromosome the file covers (one row per chromosome per study)
file_path Path to the tabix-indexed sumstats file
column_mapping_file Path to the YAML column-name map (see RSS QC vignette)

ld_meta_file_path — TSV pointing at the per-gene LD sketch genotype source. The pipeline loads an LD sketch on the fly for each gene’s window via load_ld_sketch(); only the sketch path is used — the full LD matrix is never materialised.

twas_weights_data — a named list where each entry is a TWASWeights S4 object (one per gene-region). The list keys are typically gene IDs (ENSG...); each entry stores all per-method weight matrices for that gene’s variants. This is the output you carry over from the TWAS weights vignette.

Call shape

# TODO(data): replace with real weights database, GWAS meta, LD meta
weights_db    <- list(
  ENSG00000179403 = "<TWASWeights S4 object from twas_weights_pipeline>",
  ENSG00000269699 = "<...>"
)
gwas_meta_file <- "<path/to/gwas_meta.tsv>"
ld_meta_path   <- "<path/to/ld_meta.tsv>"
region_block   <- "<chrom_start_end>"  # e.g. "1_1000000_2000000"

twas_res <- twas_pipeline(
  twas_weights_data       = weights_db,
  ld_meta_file_path       = ld_meta_path,
  gwas_meta_file          = gwas_meta_file,
  region_block            = region_block,
  ld_reference_sample_size = 17000,         # e.g. ADSP R4
  rsq_cutoff              = 0.01,            # imputability gate (R^2)
  rsq_pval_cutoff         = 0.05,            # imputability gate (p-value)
  mr_method               = "susie",         # MR using SuSiE weights
  mr_pval_cutoff          = 0.05
)

names(twas_res)         # "twas_result", "twas_data", "mr_result"
head(twas_res$twas_result[, c("molecular_id", "context", "gwas_study",
                              "method", "twas_z", "twas_pval",
                              "is_imputable", "is_selected_method")])

The pipeline internally:

  1. Loads the per-gene LD sketch for the region via load_ld_sketch() and computes its SVD once; only V, D, and the sketch sample size are carried forward — no p × p correlation matrix is ever formed.
  2. Calls harmonize_twas() to align the weights and GWAS to the LD sketch (allele QC, sign-flips, intersection on variant_id).
  3. Calls twas_analysis() per gene to apply weights to the harmonised GWAS — using the LD sketch SVD path through twas_z() and including any omnibus combination.
  4. Applies imputability gating (rsq_cutoff, rsq_pval_cutoff) and method fallback (if the selected method’s z-score is NA / non-finite, falls back to the next-best by CV R²).
  5. Optionally runs an MR layer using mr_method’s weights.

Harmonization steps

The harmonization layer can also be called directly:

  • harmonize_gwas(gwas_file, query_region, ld_variants, col_to_flip, match_min_prop, column_file_path, comment_string) — tabix-loads one GWAS file for a region, standardises columns, runs allele QC against ld_variants, returns the harmonised data.frame.
  • harmonize_twas(twas_weights_data, ld_meta_file_path, gwas_meta_file, ld_reference_sample_size, ..., impute_missing, impute_opts) — loops over genes and studies, loads each LD sketch, calls harmonize_gwas() per study, then aligns weights to the harmonised GWAS via match_ref_panel().

The intersection at the harmonization stage is by default between LD-sketch variants and GWAS variants and weight variants. Weight variants that have LD-sketch neighbours but no observed GWAS hit get dropped silently — unless you turn on the option below.

RAISS imputation of missing GWAS variants

When the GWAS panel is narrower than the LD sketch — common when the GWAS was imputed with a smaller reference, or when allele QC trims strand-ambiguous variants — some weight variants get dropped at the weight-vs-GWAS intersection in harmonize_twas(). Setting impute_missing = TRUE calls RAISS to impute GWAS z-scores for LD-sketch variants that are present in the LD sketch but absent from the GWAS:

twas_res <- twas_pipeline(
  twas_weights_data       = weights_db,
  ld_meta_file_path       = ld_meta_path,
  gwas_meta_file          = gwas_meta_file,
  region_block            = region_block,
  ld_reference_sample_size = 17000,
  impute_missing          = TRUE,                # widen GWAS via RAISS
  impute_opts             = list(rcond = 0.01,
                                 R2_threshold = 0.6,
                                 minimum_ld = 5,
                                 lamb = 0.01)
)

RAISS imputes summary statistics only — never weights. Imputed variants with R² < R2_threshold are dropped. Weight variants that have no LD-sketch neighbour at all are warned about and dropped; RAISS cannot impute without neighbours.

Output table

twas_pipeline() returns a list with:

  • twas_result — data.frame, one row per gene-context-study-method combination, with columns:

    Column Meaning
    chr Chromosome
    molecular_id Gene ID
    context Condition (tissue, cell type) name
    gwas_study GWAS study ID from gwas_meta_file
    method Weight method (e.g. susie, enet, lassosum_rss)
    is_imputable Whether the gene passed the imputability gate
    is_selected_method TRUE for the gene-context’s chosen method
    rsq_cv Cross-validated R² (individual-level path only)
    pval_cv p-value for rsq_cv
    twas_z TWAS Z-score for this method
    twas_pval TWAS p-value
    type Molecular trait type (e.g. expression, splicing)
    block LD region identifier
  • twas_data — list of harmonised gene-context inputs in the shape expected by cTWAS::ctwas_sumstats(). Only populated when output_twas_data = TRUE.

  • mr_result — MR analysis output from the chosen mr_method and coverage cutoff, when MR is enabled.

When combine_if_no_cv = TRUE (used in summary-statistics TWAS where CV performance is not available), the per-method rows are augmented with an omnibus row containing the ACAT-combined p-value.

Imputability gating and method fallback

Two gates filter the per-gene-context output:

  • Imputability: a gene-context is kept only if at least one method has rsq_cv ≥ rsq_cutoff and pval_cv < rsq_pval_cutoff. This is the individual-level CV-based gate. Summary-statistics TWAS bypasses this gate (is_imputable = TRUE for all genes; no CV performance available).
  • Selected method: within an imputable gene-context, the method with the highest rsq_cv becomes is_selected_method = TRUE. If that method’s twas_z is NA or non-finite, the pipeline falls back to the next-best method with a valid z-score, surfacing the swap as a TWAS method fallback message.

Summary

Function Use it when
twas_z() One method × one gene; you already have aligned weights, z, and the LD sketch SVD (V, D, n_sketch).
twas_joint_z() Multi-condition weights for one gene; want per-condition TWAS plus a joint test combined via any of acat, hmp, fisher, stouffer, invchisq, gbj, aspu, gates.
twas_analysis() Multi-method weights for one gene; want per-method TWAS plus an omnibus combination.
twas_pipeline() Production: weight database + tabix GWAS + LD sketch metadata → per-gene-per-context-per-study results.
harmonize_twas() / harmonize_gwas() Lower-level harmonization layer used by twas_pipeline(). Pass impute_missing = TRUE to widen GWAS via RAISS.
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /home/runner/work/pecotmr/pecotmr/.pixi/envs/default/lib/libopenblasp-r0.3.33.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] pecotmr_0.5.3
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.3.0                   coloc_5.2.3                
##   [3] bitops_1.0-9                gridExtra_2.3              
##   [5] rlang_1.2.0                 magrittr_2.0.5             
##   [7] otel_0.2.0                  matrixStats_1.5.0          
##   [9] susieR_0.16.4               compiler_4.5.3             
##  [11] RSQLite_3.53.1              GenomicFeatures_1.62.0     
##  [13] colocboost_1.0.9            png_0.1-9                  
##  [15] systemfonts_1.3.2           vctrs_0.7.3                
##  [17] quadprog_1.5-8              stringr_1.6.0              
##  [19] pkgconfig_2.0.3             crayon_1.5.3               
##  [21] fastmap_1.2.0               XVector_0.50.0             
##  [23] Rsamtools_2.26.0            rmarkdown_2.31             
##  [25] tzdb_0.5.0                  UCSC.utils_1.6.1           
##  [27] ragg_1.5.2                  purrr_1.2.2                
##  [29] bit_4.6.0                   xfun_0.57                  
##  [31] Rfast_2.1.5.2               cachem_1.1.0               
##  [33] cigarillo_1.0.0             GenomeInfoDb_1.46.2        
##  [35] jsonlite_2.0.0              blob_1.3.0                 
##  [37] reshape_0.8.10              DelayedArray_0.36.0        
##  [39] tictoc_1.2.1                BiocParallel_1.44.0        
##  [41] irlba_2.3.7                 parallel_4.5.3             
##  [43] R6_2.6.1                    VariantAnnotation_1.56.0   
##  [45] bslib_0.11.0                stringi_1.8.7              
##  [47] RColorBrewer_1.1-3          rtracklayer_1.70.1         
##  [49] GenomicRanges_1.62.1        jquerylib_0.1.4            
##  [51] Rcpp_1.1.1-1.1              Seqinfo_1.0.0              
##  [53] SummarizedExperiment_1.40.0 knitr_1.51                 
##  [55] R.utils_2.13.0              readr_2.2.0                
##  [57] IRanges_2.44.0              Matrix_1.7-5               
##  [59] tidyselect_1.2.1            viridis_0.6.5              
##  [61] abind_1.4-8                 yaml_2.3.12                
##  [63] codetools_0.2-20            curl_7.1.0                 
##  [65] plyr_1.8.9                  lattice_0.22-9             
##  [67] tibble_3.3.1                Biobase_2.70.0             
##  [69] KEGGREST_1.50.0             S7_0.2.2                   
##  [71] evaluate_1.0.5              desc_1.4.3                 
##  [73] RcppParallel_5.1.11-2       Biostrings_2.78.0          
##  [75] pillar_1.11.1               MatrixGenerics_1.22.0      
##  [77] stats4_4.5.3                ieugwasr_1.1.0             
##  [79] generics_0.1.4              vroom_1.7.1                
##  [81] RCurl_1.98-1.19             hms_1.1.4                  
##  [83] S4Vectors_0.48.0            ggplot2_4.0.3              
##  [85] scales_1.4.0                glue_1.8.1                 
##  [87] tools_4.5.3                 MungeSumstats_1.18.1       
##  [89] BiocIO_1.20.0               data.table_1.17.8          
##  [91] BSgenome_1.78.0             GenomicAlignments_1.46.0   
##  [93] fs_2.1.0                    XML_3.99-0.22              
##  [95] grid_4.5.3                  tidyr_1.3.2                
##  [97] AnnotationDbi_1.72.0        restfulr_0.0.16            
##  [99] cli_3.6.6                   zigg_0.0.2                 
## [101] textshaping_1.0.5           viridisLite_0.4.3          
## [103] mixsqp_0.3-54               S4Arrays_1.10.1            
## [105] dplyr_1.2.1                 gtable_0.3.6               
## [107] R.methodsS3_1.8.2           sass_0.4.10                
## [109] digest_0.6.39               BiocGenerics_0.56.0        
## [111] SparseArray_1.10.8          rjson_0.2.23               
## [113] htmlwidgets_1.6.4           farver_2.1.2               
## [115] memoise_2.0.1               htmltools_0.5.9            
## [117] pkgdown_2.2.0               R.oo_1.27.1                
## [119] lifecycle_1.0.5             httr_1.4.8                 
## [121] bit64_4.8.2                 MASS_7.3-65