Getting Started#
The FunGen-xQTL protocol is a reproducible, end-to-end pipeline for molecular quantitative trait loci (QTL) analysis — from raw genotypes and phenotypes through discovery, fine-mapping, and integration with GWAS.
This page is a guided on-ramp. A minimal toy dataset of 49 de-identified samples is used throughout the examples so you can try every pipeline end-to-end before running on real data. In about an hour you’ll install the environment, clone the repo, download the demo dataset, and run your first cis-QTL scan.
See also
New to the consortium? Start with How to use the resource on the homepage for the big-picture background, then come back here to set up.
Before You Start#
You’ll need a Linux or macOS shell. Windows users: install WSL2 first, then follow the Linux path.
Requirement |
Minimum |
Recommended |
|---|---|---|
Disk space |
10 GB (minimal install) |
40 GB (full bioinformatics stack) |
Memory |
16 GB |
50 GB+ on HPC for the installer |
Network |
GitHub, conda-forge, synapse.org |
Same |
Git |
Any recent version |
2.30+ |
Tip
On HPC — make sure you have access to a compute node with at least 50 GB of memory for the pixi installation step (Step 2). Login nodes often kill large installs. See Step 2 for details.
Step 1. Install SoS in a Conda Environment#
The protocol’s pipelines are written as SoS (Script of Scripts) workflows. First, create a dedicated conda environment and install SoS along with its language modules. Full installation reference: SoS Conda installation guide.
If you don’t have conda yet, install Miniforge (recommended) or Anaconda.
# Create and activate a new environment
conda create -n sos python=3.12 -y
conda activate sos
# Install the full SoS suite
conda install -c conda-forge \
sos sos-pbs sos-notebook jupyterlab-sos sos-papermill \
sos-bash sos-python sos-r
# Register the SoS kernel with Jupyter
python -m sos_notebook.install
Verify:
sos --version
jupyter kernelspec list # should include 'sos'
Tip
Make sure you always conda activate sos before running any pipeline commands.
Step 2. Install the xQTL Software Stack with pixi#
Next, install the bioinformatics and data-science packages the protocol depends on using pixi via the StatFunGen/pixi-setup installer.
On HPC systems, your home directory likely has a storage quota that won’t fit the full install. Temporarily point $HOME to a path with enough space, and add pixi to your $PATH:
# Point HOME to a location with enough disk space
export HOME="/your_pixi_install_path"
# Add pixi to your path
export PATH="/your_pixi_install_path/.pixi/bin:$PATH"
Then run the installer:
curl -fsSL https://raw.githubusercontent.com/StatFunGen/pixi-setup/refs/heads/main/pixi-setup.sh | bash
On a laptop or workstation you can skip the HOME/PATH exports and just run the curl command — the installer will prompt you to choose an install path and type interactively.
The installer will prompt you for two things:
1. Installation path — where pixi stores environments and packages.
Setting |
When to use |
|---|---|
|
Laptops and workstations with plenty of home-directory space |
|
HPC systems with strict home-directory quotas |
2. Installation type
Type |
Size |
Files |
Includes |
|---|---|---|---|
1. minimal |
~5 GB |
~100k |
CLI tools, Python data-science stack, JupyterLab, base R (tidyverse, devtools, IRkernel) |
2. full |
~35 GB |
~350k |
Everything above, plus samtools, bcftools, plink2, GATK4, STAR, Seurat, tensorQTL, Bioconductor |
Choose minimal for xQTL runs with pre-processed inputs; choose full if you’ll also do upstream QC, alignment, or single-cell work.
Activate and verify:
source ~/.bashrc # or ~/.zshrc on macOS
pixi --version
Warning
On HPC, run the installer from a compute node with at least 50 GB of memory, not the login node. The install process can be memory-intensive and may be killed on login nodes:
srun --mem=50G --pty bash # SLURM
bsub -Is -M 50000 -n 4 bash # LSF
Step 3. Clone the Protocol#
git clone https://github.com/StatFunGen/xqtl-protocol.git
cd xqtl-protocol
Note
What’s in the repo
Folder |
Contents |
|---|---|
|
The SoS workflows you’ll run |
|
Notebook-based documentation (this page lives here) |
|
Small example inputs and configuration templates |
|
JupyterBook sources for statfungen.github.io/xqtl-protocol |
Step 4. Download the Demo Data#
Preparation of the demo dataset is documented on this page (a private repository accessible to FunGen-xQTL working group members). The data itself lives on Synapse.
Create a free account on synapse.org — username and password are required to download.
Install the Synapse API client into pixi’s python environment:
pixi global install -c conda-forge --environment python synapseclient
Alternatively,
pip install synapseclient. See the Synapse install docs for details.Every folder at each level of the Synapse project has its own unique ID, so you can download just the subset you need.
Bulk RNA-seq molecular phenotype quantification — test data:
import synapseclient
import synapseutils
syn = synapseclient.Synapse()
syn.login("your username on synapse.org", "your password on synapse.org")
files = synapseutils.syncFromSynapse(syn, 'syn53174239', path="./")
xQTL association analysis — test data:
import synapseclient
import synapseutils
syn = synapseclient.Synapse()
syn.login("your username on synapse.org", "your password on synapse.org")
files = synapseutils.syncFromSynapse(syn, 'syn52369482', path="./")
Step 5. Run Your First Workflow#
Confirm SoS can see the pipelines:
sos run pipeline/1_xqtl_association.ipynb -h
You should see a list of workflow options. Now run a minimal cis-QTL scan using the demo data you just downloaded:
sos run pipeline/TensorQTL.ipynb cis \
--genotype-file data/example/genotype.bed \
--phenotype-file data/example/phenotype.bed.gz \
--covariate-file data/example/covariates.tsv \
--cwd output/demo_tensorqtl
Results land in output/demo_tensorqtl/. You now have a working environment and a known-good reference run to compare against when you bring in your own data.
Tip
Every pipeline supports -h and --help, and SoS prints the exact shell commands it runs under the hood — a great way to learn what’s happening and to debug failures.
Analysis#
Please visit the homepage of the protocol website for the general background on this resource, in particular the How to use the resource section. To perform a complete analysis from molecular phenotype quantification to xQTL discovery, conduct your analysis in the order listed below. Each link contains a mini-protocol for a specific task, and all commands should be executed from the command line.
Important
Minimum Working Example — new users, start here.
Every module ships a minimal test dataset (prefixed with MWE) under Synapse syn36416559. To go end-to-end on the demo data, run these five pipelines in order and skip everything else on the first pass:
reference_data.ipynb— prepare standardized reference filesbulk_expression.ipynb— quantify gene expressiongenotype_preprocessing.ipynb→phenotype_preprocessing.ipynb→covariate_preprocessing.ipynb— QC and normalizationqtl_association_testing.ipynb— cis-QTL with TensorQTLmnm_miniprotocol.ipynb— fine-mapping + TWAS with SuSiE
Once this pass completes, branch out to the additional modules below based on what your project needs.
1. Reference Data#
Multiple reference data files are required before molecular phenotypes are quantified — reference genomes, gene annotations, variant annotations, linkage disequilibrium data and topologically associated domains.
Reference data — overview and required input files ⭐ MWE
Reference data preparation — downloading and standardizing reference files
Generalized TAD boundaries — topologically associating domain annotations
LD reference pruning — pruned LD reference panels
RSS LD sketching — LD matrix sketches for summary-statistics methods
2. Molecular Phenotype Quantification#
Molecular phenotypic data is required for the generation of QTLs. We support bulk RNA-Seq, methylation and splicing phenotypes. Quantification of gene expression is conducted with either RNA-SeQC for gene-level counts, or RSEM for transcript-level counts. Quantification of alternative splicing events is conducted with leafcutter2 to identify alternatively excised introns. Quantification of DNA methylation is done using SeSAMe. Each phenotype then undergoes phenotype-specific quality control and normalization.
Gene expression (RNA-seq) — RNA-SeQC or RSEM ⭐ MWE
Alternative splicing — leafcutter2
DNA methylation — SeSAMe
3. Data Pre-Processing#
Preprocessing of genotype data begins with the application of variant filters using bcftools. VCF files are then converted to plink format so that kinship analyses may be performed to identify unrelated individuals. Genetic principal components are then generated for unrelated samples and genotype files are formatted for QTL analysis. Preprocessing of phenotypic data begins with annotation of features, followed by imputation of missing entries and formatting. Preprocessing of covariates merges phenotypic data with genetic principal components, then computes hidden factors to use as additional covariates.
4. QTL Association Testing#
QTL association analysis is conducted with TensorQTL. We include options for cis or trans analysis, with options to include interaction terms. Hierarchical multiple testing may then be applied to adjust p-values.
QTL association testing ⭐ MWE
TensorQTL — cis/trans scans with optional interaction terms
Quantile regression QTL & TWAS — non-linear genotype-phenotype effects
Association post-processing — hierarchical multiple testing correction
5. Multivariate Mixture Model#
For multi-context or multi-tissue analyses, we provide a multivariate mixture model framework based on MASH. This learns a data-driven mixture prior across contexts and estimates effect sizes and posterior probabilities for sharing of eQTLs across tissues.
Multivariate mixture vignette — overview and walkthrough
Mixture prior estimation (MASH) — learn data-driven covariance matrices
MASH model fitting — fit the model and compute posterior summaries
6. Multiomics Regression Models (Fine-mapping)#
Our pipeline includes multiple methods for fine-mapping of QTLs. Univariate fine-mapping and TWAS with SuSiE generates TWAS weights and credible sets. Regression with summary statistics allows inclusion of GWAS summary stats in SuSiE fine-mapping. Univariate fine-mapping of functional data uses epigenomic annotations with fSuSiE.
7. GWAS Integration#
We include methods for colocalization analysis, starting with the generation of prior probabilities followed by pairwise colocalization of xQTL and GWAS fine-mapping results to identify shared causal variants. We also include TWAS and cTWAS to identify genes associated with complex traits.
Colocalization (SuSiE-enloc) — pairwise xQTL-GWAS colocalization
TWAS & cTWAS — genes associated with complex traits
ColocBoost — shared-variant discovery across molecular traits
8. Enrichment and Validation#
We utilize an excess of overlap method to evaluate the enrichment of significant variants within specific genomic annotations. Pathway enrichment analysis identifies biological pathways that are statistically overrepresented in a given gene set. Stratified LD Score Regression (S-LDSC) quantifies the contribution of genomic functional annotations to heritability of complex traits. By integrating GWAS summary statistics with genome annotations, S-LDSC distinguishes true polygenic signals from confounding effects.
Excess-of-overlap enrichment — variant enrichment in genomic annotations
Gene set enrichment (GSEA) — overrepresented biological pathways
GREGOR — annotation-based enrichment for regulatory variants
Stratified LD Score Regression — heritability partitioning by annotation
9. xQTL Modifier Score (EMS)#
The xQTL modifier score framework trains a per-variant model for prioritizing regulatory variants.
EMS training — fit the model using functional annotation features
EMS prediction — score new variants
10. Command Generator#
eQTL analysis command generator — produce full pipeline commands from a single configuration file
Software Environment#
Every protocol on this site runs inside the pixi environment configured in Steps 1-2. Once pixi and SoS are installed, each example “just works” — no per-pipeline container, no manual dependency wrangling.
Need something extra? Install it into the right pixi environment:
# Python package (into the shared python env)
pixi global install -c conda-forge --environment python <package>
# R package (into the r-base env)
pixi global install -c conda-forge --environment r-base r-<package>
# Standalone bioinformatics CLI tool
pixi global install -c bioconda <tool>
Troubleshooting#
Warning
R library conflicts. If you see an error like
Error in dyn.load(file, DLLpath = DLLPath, ...):
unable to load shared object '$PATH/R/x86_64-pc-linux-gnu-library/4.2/stringi/libs/stringi.so':
libicui18n.so.63: cannot open shared object file: No such file or directory
your system R libraries are being picked up alongside the pixi ones. Unset them before running the pipeline:
export R_LIBS=""
export R_LIBS_USER=""
pixi: command not found — open a new terminal, or re-source your shell rc file (source ~/.bashrc on Linux/HPC, source ~/.zshrc on macOS).
Installer killed on HPC — you’re on a login node. Request a compute node with ≥ 50 GB memory and re-run.
sos: command not found — Step 1 didn’t complete. Re-run the conda install command for SoS.
ModuleNotFoundError during a pipeline — install the missing package into pixi’s python env with the command above.
Still stuck? Open an issue with the command you ran and the full error output.
Analyses on High Performance Computing Clusters#
The demo on this page runs on a desktop workstation. Production analyses typically run on an HPC cluster, and SoS supports this natively via SoS Remote Tasks on configured host computers.
We provide a toy example for running SoS pipelines on a typical HPC cluster environment — first-time users are encouraged to work through it before launching real jobs. It covers the host and task configuration you’ll reuse for every subsequent pipeline, and it’s schedule-agnostic (SLURM, LSF, SGE, PBS/Torque all work).