Inferring TWAS Z-Scores from GWAS Summary Statistics
pecotmr authors
2026-06-11
Source:vignettes/twas-zscore.Rmd
twas-zscore.RmdOverview
This vignette picks up where the TWAS
weights vignette leaves off: a TWASWeights S4 object
with per-gene per-variant weights already exists. The remaining task is
to combine those weights with GWAS summary statistics to obtain a
per-gene TWAS Z-score.
pecotmr exposes this inference at three layers:
| Layer | Entry point | When to use |
|---|---|---|
| File-based pipeline | twas_pipeline() |
Production: per-gene weights database + tabix-indexed GWAS sumstats + LD sketch genotypes; outputs per-gene-per-context-per-study TWAS results with imputability gating, method fallback, and optional MR. |
| Per-gene inference | twas_analysis() |
In-memory: applies many weight methods to one gene’s GWAS sumstats and combines per-method p-values (ACAT / HMP / Fisher / Stouffer / etc.) into an omnibus test. |
| Single-method math |
twas_z(), twas_joint_z()
|
When you already have aligned weights, z,
and an LD sketch. Convenient for custom workflows or method
comparisons. |
The TWAS Z-score formula
Per single weight vector w and GWAS z-score vector
z, the TWAS test statistic is
z_\text{TWAS} \;=\; \frac{w^\top z}{\sqrt{w^\top R\, w}}
where R is the LD matrix at the weight variants. Under
the null, z_TWAS follows a standard normal.
Eigenvalue / eigenvector reparameterisation
pecotmr evaluates the denominator from the full
SVD of the standardised LD sketch genotype matrix rather than
materialising R. If X = U D V^\top is the
(untruncated) SVD of the standardised sketch, then
R \;=\; \frac{X^\top X}{n_\text{sketch}-1} \;=\; V\, \Lambda\, V^\top, \qquad \Lambda_i \;=\; \frac{D_i^{2}}{n_\text{sketch}-1},
and
w^\top R\, w \;=\; \sum_{i=1}^{r} \Lambda_i \,\bigl(V^\top w\bigr)_i^{2}, \qquad r \;=\; \mathrm{rank}(X).
The sum runs over every nonzero singular value — no top-K truncation. The two expressions are algebraically equivalent (and numerically agree to floating-point precision when both are computed), so this is a reparameterisation, not an approximation.
We use this form for two practical reasons:
- The SVD is computed once per gene region and reused
across every method × condition × study evaluation. Each subsequent
denominator costs
O(r p)forV^\top wplusO(r)for the weighted sum, versusO(p^2)to multiply against an explicitR. - When the LD sketch has
n_\text{sketch} \le p,Ris rank-deficient (rankr, withn_\text{sketch} \ge r). Working in ther-dimensional eigenbasis represents exactly the information the sketch carries and avoids forming a singularp \times pmatrix.
In practice this means twas_z() and
twas_joint_z() consume (V, D, n_\text{sketch})
directly from the per-gene LD sketch — R is never formed at
any step of the pipeline.
Single-method inference: twas_z()
Given aligned weights, GWAS z-scores, and the LD sketch components,
twas_z() returns the per-gene Z and two-sided p-value:
# Inputs from the upstream pipelines (twas-weights.html and earlier):
# w : numeric vector, one weight per variant for one gene/method
# gwas_z : numeric vector of GWAS z-scores, aligned to the same variants
# V, D, n : right singular vectors, singular values, sample size of the
# LD sketch for the gene region
res <- twas_z(weights = w, z = gwas_z,
V = V, D = D, n_sketch = n)
res$z # TWAS z-score
res$pval # two-sided p-valueV, D, and n_sketch come from
the LD sketch the pipeline already loads for each gene (see “The
file-based pipeline” below); you do not need to compute them yourself in
the normal workflow.
Combining multiple methods per gene:
twas_analysis()
When the gene’s TWASWeights carries weights for several
methods (the default for both twas_weights_pipeline() and
twas_weights_sumstat_pipeline()),
twas_analysis() applies each method’s weights to the GWAS
and optionally combines the per-method p-values into an omnibus
test:
# weights_matrix: variants x methods (one column per weight method)
# gwas_sumstats_db: data.frame with variant_id, z, ... for the GWAS
# V, D, n_sketch, ld_variant_ids: LD sketch components for the gene region
res <- twas_analysis(
weights_matrix = weights_matrix,
gwas_sumstats_db = gwas_sumstats_db,
extract_variants_objs = rownames(weights_matrix),
V = V, D = D, n_sketch = n_sketch, ld_variant_ids = ld_variant_ids,
combine_method = "acat",
combine_if_no_cv = TRUE
)The output is a list keyed by method, each entry containing
z and pval. When
combine_if_no_cv = TRUE and at least two methods produce
valid p-values, an omnibus entry is appended using the
chosen combine method:
combine_method |
Combination |
|---|---|
"acat" (default) |
Cauchy combination |
"hmp" |
Harmonic mean p-value |
"fisher" |
Fisher’s method |
"stouffer" |
Stouffer’s Z |
"invchisq" |
Inverse chi-square |
"gbj" |
Generalized Berk-Jones |
"aspu" |
Adaptive sum of powered scores |
"gates" |
GATES gene-based test |
For summary-statistics TWAS where no cross-validation performance is
available to gate methods, combine_if_no_cv = TRUE is the
standard default. For individual-level TWAS where CV performance has
selected a best method per gene (twas_pipeline() handles
this automatically), the omnibus is generally not needed.
Multi-condition joint test: twas_joint_z()
For per-condition weights (the output of
twas_multivariate_weights_pipeline() or
twas_multivariate_weights_sumstat_pipeline()),
twas_joint_z() computes a per-condition TWAS Z plus a
cross-condition joint test using any of the same p-value combination
methods as twas_analysis():
# weights: variants x conditions matrix
# gwas_z: GWAS z-scores aligned to the rows of `weights`
# V, D_svd, n_sketch: LD sketch SVD for the gene region
joint <- twas_joint_z(
weights = weights,
z = gwas_z,
V = V,
D_svd = D_svd,
n_sketch = n_sketch,
combine_method = "acat" # or "hmp", "fisher", "stouffer", "invchisq",
# "gbj", "aspu", "gates"
)
joint$Z # per-condition z + p-value
joint$combined$method # the combine method used
joint$combined$pval # joint cross-condition p-valueAll eight methods listed in the twas_analysis() table
above are accepted. The cross-condition correlation matrix induced by
the weights and the LD sketch is computed internally and passed to
whichever method needs it (Fisher, Stouffer, inverse chi-square, GBJ,
aSPU, GATES); ACAT and HMP work on the p-values directly.
The file-based pipeline: twas_pipeline()
For production use you generally have:
- A weights database produced by
twas_weights_pipeline()(or the summary-statistics analog), saved as one entry per gene-region. - Tabix-indexed GWAS sumstats files, one per study.
- An LD metadata TSV pointing at LD sketch genotype files (one per-chromosome entry, format-agnostic — the loader resolves the underlying genotype representation).
twas_pipeline() consumes all three and produces a
per-gene-per-context-per-study TWAS results table, plus an optional MR
analysis layer.
Input formats
gwas_meta_file — a TSV with these
columns:
| Column | Meaning |
|---|---|
study_id |
Identifier for the GWAS study |
chrom |
Integer chromosome the file covers (one row per chromosome per study) |
file_path |
Path to the tabix-indexed sumstats file |
column_mapping_file |
Path to the YAML column-name map (see RSS QC vignette) |
ld_meta_file_path — TSV pointing at the
per-gene LD sketch genotype source. The pipeline loads an LD sketch on
the fly for each gene’s window via load_ld_sketch(); only
the sketch path is used — the full LD matrix is never materialised.
twas_weights_data — a named list where
each entry is a TWASWeights S4 object (one per
gene-region). The list keys are typically gene IDs
(ENSG...); each entry stores all per-method weight matrices
for that gene’s variants. This is the output you carry over from the TWAS weights vignette.
Call shape
# TODO(data): replace with real weights database, GWAS meta, LD meta
weights_db <- list(
ENSG00000179403 = "<TWASWeights S4 object from twas_weights_pipeline>",
ENSG00000269699 = "<...>"
)
gwas_meta_file <- "<path/to/gwas_meta.tsv>"
ld_meta_path <- "<path/to/ld_meta.tsv>"
region_block <- "<chrom_start_end>" # e.g. "1_1000000_2000000"
twas_res <- twas_pipeline(
twas_weights_data = weights_db,
ld_meta_file_path = ld_meta_path,
gwas_meta_file = gwas_meta_file,
region_block = region_block,
ld_reference_sample_size = 17000, # e.g. ADSP R4
rsq_cutoff = 0.01, # imputability gate (R^2)
rsq_pval_cutoff = 0.05, # imputability gate (p-value)
mr_method = "susie", # MR using SuSiE weights
mr_pval_cutoff = 0.05
)
names(twas_res) # "twas_result", "twas_data", "mr_result"
head(twas_res$twas_result[, c("molecular_id", "context", "gwas_study",
"method", "twas_z", "twas_pval",
"is_imputable", "is_selected_method")])The pipeline internally:
- Loads the per-gene LD sketch for the region via
load_ld_sketch()and computes its SVD once; onlyV,D, and the sketch sample size are carried forward — nop × pcorrelation matrix is ever formed. - Calls
harmonize_twas()to align the weights and GWAS to the LD sketch (allele QC, sign-flips, intersection onvariant_id). - Calls
twas_analysis()per gene to apply weights to the harmonised GWAS — using the LD sketch SVD path throughtwas_z()and including any omnibus combination. - Applies imputability gating (
rsq_cutoff,rsq_pval_cutoff) and method fallback (if the selected method’s z-score is NA / non-finite, falls back to the next-best by CV R²). - Optionally runs an MR layer using
mr_method’s weights.
Harmonization steps
The harmonization layer can also be called directly:
-
harmonize_gwas(gwas_file, query_region, ld_variants, col_to_flip, match_min_prop, column_file_path, comment_string)— tabix-loads one GWAS file for a region, standardises columns, runs allele QC againstld_variants, returns the harmonised data.frame. -
harmonize_twas(twas_weights_data, ld_meta_file_path, gwas_meta_file, ld_reference_sample_size, ..., impute_missing, impute_opts)— loops over genes and studies, loads each LD sketch, callsharmonize_gwas()per study, then aligns weights to the harmonised GWAS viamatch_ref_panel().
The intersection at the harmonization stage is by default between LD-sketch variants and GWAS variants and weight variants. Weight variants that have LD-sketch neighbours but no observed GWAS hit get dropped silently — unless you turn on the option below.
RAISS imputation of missing GWAS variants
When the GWAS panel is narrower than the LD sketch — common when the
GWAS was imputed with a smaller reference, or when allele QC trims
strand-ambiguous variants — some weight variants get dropped at the
weight-vs-GWAS intersection in harmonize_twas(). Setting
impute_missing = TRUE calls RAISS to impute GWAS z-scores
for LD-sketch variants that are present in the LD sketch but absent from
the GWAS:
twas_res <- twas_pipeline(
twas_weights_data = weights_db,
ld_meta_file_path = ld_meta_path,
gwas_meta_file = gwas_meta_file,
region_block = region_block,
ld_reference_sample_size = 17000,
impute_missing = TRUE, # widen GWAS via RAISS
impute_opts = list(rcond = 0.01,
R2_threshold = 0.6,
minimum_ld = 5,
lamb = 0.01)
)RAISS imputes summary statistics only — never
weights. Imputed variants with R² < R2_threshold are
dropped. Weight variants that have no LD-sketch neighbour at all are
warned about and dropped; RAISS cannot impute without neighbours.
Output table
twas_pipeline() returns a list with:
-
twas_result— data.frame, one row per gene-context-study-method combination, with columns:Column Meaning chrChromosome molecular_idGene ID contextCondition (tissue, cell type) name gwas_studyGWAS study ID from gwas_meta_filemethodWeight method (e.g. susie,enet,lassosum_rss)is_imputableWhether the gene passed the imputability gate is_selected_methodTRUE for the gene-context’s chosen method rsq_cvCross-validated R² (individual-level path only) pval_cvp-value for rsq_cvtwas_zTWAS Z-score for this method twas_pvalTWAS p-value typeMolecular trait type (e.g. expression,splicing)blockLD region identifier twas_data— list of harmonised gene-context inputs in the shape expected bycTWAS::ctwas_sumstats(). Only populated whenoutput_twas_data = TRUE.mr_result— MR analysis output from the chosenmr_methodand coverage cutoff, when MR is enabled.
When combine_if_no_cv = TRUE (used in summary-statistics
TWAS where CV performance is not available), the per-method rows are
augmented with an omnibus row containing the ACAT-combined
p-value.
Imputability gating and method fallback
Two gates filter the per-gene-context output:
-
Imputability: a gene-context is kept only if
at least one method has
rsq_cv ≥ rsq_cutoffandpval_cv < rsq_pval_cutoff. This is the individual-level CV-based gate. Summary-statistics TWAS bypasses this gate (is_imputable = TRUEfor all genes; no CV performance available). -
Selected method: within an imputable gene-context,
the method with the highest
rsq_cvbecomesis_selected_method = TRUE. If that method’stwas_zis NA or non-finite, the pipeline falls back to the next-best method with a valid z-score, surfacing the swap as aTWAS method fallbackmessage.
Summary
| Function | Use it when |
|---|---|
twas_z() |
One method × one gene; you already have aligned
weights, z, and the LD sketch SVD
(V, D, n_sketch). |
twas_joint_z() |
Multi-condition weights for one gene; want per-condition TWAS plus a
joint test combined via any of acat, hmp,
fisher, stouffer, invchisq,
gbj, aspu, gates. |
twas_analysis() |
Multi-method weights for one gene; want per-method TWAS plus an omnibus combination. |
twas_pipeline() |
Production: weight database + tabix GWAS + LD sketch metadata → per-gene-per-context-per-study results. |
harmonize_twas() / harmonize_gwas()
|
Lower-level harmonization layer used by
twas_pipeline(). Pass impute_missing = TRUE to
widen GWAS via RAISS. |
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-conda-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS/LAPACK: /home/runner/work/pecotmr/pecotmr/.pixi/envs/default/lib/libopenblasp-r0.3.33.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pecotmr_0.5.3
##
## loaded via a namespace (and not attached):
## [1] DBI_1.3.0 coloc_5.2.3
## [3] bitops_1.0-9 gridExtra_2.3
## [5] rlang_1.2.0 magrittr_2.0.5
## [7] otel_0.2.0 matrixStats_1.5.0
## [9] susieR_0.16.4 compiler_4.5.3
## [11] RSQLite_3.53.1 GenomicFeatures_1.62.0
## [13] colocboost_1.0.9 png_0.1-9
## [15] systemfonts_1.3.2 vctrs_0.7.3
## [17] quadprog_1.5-8 stringr_1.6.0
## [19] pkgconfig_2.0.3 crayon_1.5.3
## [21] fastmap_1.2.0 XVector_0.50.0
## [23] Rsamtools_2.26.0 rmarkdown_2.31
## [25] tzdb_0.5.0 UCSC.utils_1.6.1
## [27] ragg_1.5.2 purrr_1.2.2
## [29] bit_4.6.0 xfun_0.57
## [31] Rfast_2.1.5.2 cachem_1.1.0
## [33] cigarillo_1.0.0 GenomeInfoDb_1.46.2
## [35] jsonlite_2.0.0 blob_1.3.0
## [37] reshape_0.8.10 DelayedArray_0.36.0
## [39] tictoc_1.2.1 BiocParallel_1.44.0
## [41] irlba_2.3.7 parallel_4.5.3
## [43] R6_2.6.1 VariantAnnotation_1.56.0
## [45] bslib_0.11.0 stringi_1.8.7
## [47] RColorBrewer_1.1-3 rtracklayer_1.70.1
## [49] GenomicRanges_1.62.1 jquerylib_0.1.4
## [51] Rcpp_1.1.1-1.1 Seqinfo_1.0.0
## [53] SummarizedExperiment_1.40.0 knitr_1.51
## [55] R.utils_2.13.0 readr_2.2.0
## [57] IRanges_2.44.0 Matrix_1.7-5
## [59] tidyselect_1.2.1 viridis_0.6.5
## [61] abind_1.4-8 yaml_2.3.12
## [63] codetools_0.2-20 curl_7.1.0
## [65] plyr_1.8.9 lattice_0.22-9
## [67] tibble_3.3.1 Biobase_2.70.0
## [69] KEGGREST_1.50.0 S7_0.2.2
## [71] evaluate_1.0.5 desc_1.4.3
## [73] RcppParallel_5.1.11-2 Biostrings_2.78.0
## [75] pillar_1.11.1 MatrixGenerics_1.22.0
## [77] stats4_4.5.3 ieugwasr_1.1.0
## [79] generics_0.1.4 vroom_1.7.1
## [81] RCurl_1.98-1.19 hms_1.1.4
## [83] S4Vectors_0.48.0 ggplot2_4.0.3
## [85] scales_1.4.0 glue_1.8.1
## [87] tools_4.5.3 MungeSumstats_1.18.1
## [89] BiocIO_1.20.0 data.table_1.17.8
## [91] BSgenome_1.78.0 GenomicAlignments_1.46.0
## [93] fs_2.1.0 XML_3.99-0.22
## [95] grid_4.5.3 tidyr_1.3.2
## [97] AnnotationDbi_1.72.0 restfulr_0.0.16
## [99] cli_3.6.6 zigg_0.0.2
## [101] textshaping_1.0.5 viridisLite_0.4.3
## [103] mixsqp_0.3-54 S4Arrays_1.10.1
## [105] dplyr_1.2.1 gtable_0.3.6
## [107] R.methodsS3_1.8.2 sass_0.4.10
## [109] digest_0.6.39 BiocGenerics_0.56.0
## [111] SparseArray_1.10.8 rjson_0.2.23
## [113] htmlwidgets_1.6.4 farver_2.1.2
## [115] memoise_2.0.1 htmltools_0.5.9
## [117] pkgdown_2.2.0 R.oo_1.27.1
## [119] lifecycle_1.0.5 httr_1.4.8
## [121] bit64_4.8.2 MASS_7.3-65