Illustration of xQTL protocol#

This notebook illustrates the computational protocols available from this repository for the detection and analysis of molecular QTLs (xQTLs). A minimal toy data-set consisting of 49 de-identified samples are used for the analysis.

Analysis#

Please visit the homepage of the protocol website for the general background on this resource, in particular the How to use the resource section. To perform a complete analysis from molecular phenotype quantification to xQTL discovery, please conduct your analysis in the order listed below, each link contains a mini-protocol for a specific task. All commands documented in each mini-protocol should be executed in the command line environment.

Molecular Phenotype Quantification#

Molecular phenotypic data is required for the generation of QTLs. We support bulk RNA-Seq, methylation and splicing phenotypes in our pipeline. Multiple reference data files are required before molecular phenotypes are quantified in samples. These include, but are not limited to, reference genomes, gene annotations, variant annotations, linkage disequilibirum data and topologically associated domains. Quantification of gene expression is conducted with either RNA-SeQC for gene-level counts, or RSEM for transcript-level counts. Quantification of alternative splicing events is conducted with leafcutter2 to identify alternatively excised introns. Quantification of DNA methylation is done using SeSAMe. Each of these molecular phenotypes then undergo phenotype specific quality control and normalization.

Data Pre-Processing#

Preprocessing of genotype data begins with the application of variant filters using bcftools. VCF files are then converted to plink format so that kinship analyses may be performed to identify unrelated individuals. Genetic principal components are then generated for unrelated samples and genotype files are formatted for later generation of quantitative trait loci.

Preprocessing of phenotypic data begins with annotation of features, if required. Missing entries may then be imputed using a variety of methods included in the pipeline. Last, the phenotypes are formatted for later generation of quantitative trait loci.

Preprocessing of covariates begins with the merging of phenotypic data with previously generated genetic principal components. The merged data is then used to calculate hidden factors which will later be used as additional covariates.

QTL Association Analysis#

QTL association analysis is conducted with TensorQTL. We include options for cis or trans analysis, with options to include interaction terms. Hierarchical multiple testing may then be applied to the results to adjust p-values.

Integrative Analysis#

We include methods to conduct TWAS in our pipeline to identify genes associated with complex traits.

Our pipeline includes multiple methods for fine-mapping of QTLs. Univariate fine-mapping and TWAS with SuSiE generates TWAS weights and credible sets using SuSiE. Regression with summary statistics allows for the inclusion of summary statistics from GWAS in SuSiE finemapping. Univariate fine-mapping of functional data uses epigenomic data to fine-map with fSuSiE.

We also include method for colocalization analysis. This starts with the generation of prior probabilities followed by pairwise colocalization analysis of xQTL and GWAS fine-mapping results to identifies shared causal variants. We also include an alternative method, colocboost, to identify shared genetic variants influencing multiple molecular traits.

We utilize an excess of overlap method to evaluate the enrichment of significant variants within specific genomic annotations. Pathway enrichment analysis identifies biological pathways that are statistically overrepresented in a given gene set, giving information on potential biological functions, disease relevance, or regulatory mechanisms associated with the gene set. Stratified LD Score Regression (S-LDSC) is used to quantify the contribution of different genomic functional annotations to the heritability of complex traits and assess their statistical significance. By integrating GWAS summary statistics with genome annotations, S-LDSC distinguishes true polygenic signals from confounding effects.

Data#

For record keeping: preparation of the demo dataset is documented on this page — this is a private repository accessible to FunGen-xQTL analysis working group members.

For protocols listed in this page, downloaded required input data in Synapse.

  • To be able downloading the data, first create user account on Synapse Login. Username and password will be required when downloading

  • Downloading required installing of Synapse API Clients, type pip install synapseclient in terminal or Command Prompt to install the Python package. Details list on this page.

  • Each folder in different level has unique Synapse ID, which allowing you to download only some folders or files within the entire folder.

To download the test data for section “Bulk RNA-seq molecular phenotype quantification”, please use the following Python codes,

import synapseclient 
import synapseutils 
syn = synapseclient.Synapse()
syn.login("your username on synapse.org","your password on synapse.org")
files = synapseutils.syncFromSynapse(syn, 'syn53174239', path="./")

To download the test data for section “xQTL association analysis”, please use the following Python codes,

import synapseclient 
import synapseutils 
syn = synapseclient.Synapse()
syn.login("your username on synapse.org","your password on synapse.org")
files = synapseutils.syncFromSynapse(syn, 'syn52369482', path="./")

Software environment: use Singularity containers#

Analysis documented on this website are best performed using containers we provide either through singularity (recommended) or docker, via the --container option pointing to a container image file. For example, --container oras://ghcr.io/statfungen/tensorqtl_apptainer:latest uses a singularity image to perform analysis for QTL association mapping via software TensorQTL. If you drop the --container option then you will rely on software installed on your computer to perform the analysis.

Troubleshooting#

If you run into errors relating to R libraries while including the --container option then you may need to unload your R packages locally before running the sos commands. For example, this error:

Error in dyn.load(file, DLLpath = DLLPath, ...):
unable to load shared object '$PATH/R/x86_64-pc-linux-gnu-library/4.2/stringi/libs/stringi.so':
libicui18n.so.63: cannot open shared object file: No such file or directory

May be fixed by running this before the sos commands are run:

export R_LIBS=""
export R_LIBS_USER=""

Analyses on High Performance Computing clusters#

The protocol example shown above performs analysis on a desktop workstation, as a demonstration. Typically the analyses should be performed on HPC cluster environments. This can be achieved via SoS Remote Tasks on configured host computers. We provide this toy example for running SoS pipeline on a typical HPC cluster environment. First time users are encouraged to try it out in order to help setting up the computational environment necessary to run the analysis in this protocol.