Alternative polyadenylation#

Description#

This mini-protocol turns aligned RNA-seq reads into an analysis-ready alternative-polyadenylation (APA) phenotype matrix for apaQTL analysis. It chains two pipeline modules through their pipeline/ symlinks. First, APA calling builds a 3’UTR reference, converts transcriptome BAM files to per-base coverage, and runs DaPars2 to quantify a percentage-of-distal-polyA-site-usage (PDUI) matrix per chromosome. Second, post-APA imputation and QC fills missing PDUI values, applies quantile normalization, and optionally renames sample columns so that downstream covariate, association, and fine-mapping steps receive a complete matrix. Follow the steps in order; each is a single command on the toy data.

Input#

File

Description

Transcriptome BAM files

Per-sample RNA-seq alignments to the transcriptome, collected under output/rnaseq/bam.

Reference GTF (--hg-gtf)

Gene annotation used to derive the 3’UTR reference regions (e.g. chr22.gtf on the toy data).

Sample match table

Optional tab-delimited table mapping internal IDs to final sample names, used by the rename step.

Steps#

i. Generate the 3’UTR reference regions from a GTF annotation:#

Timing: Runtime varies by dataset size and compute resources. For the toy chr22 MWE dataset, most steps complete in under 10 minutes on a standard HPC node.

sos run pipeline/apa_calling.ipynb UTR_reference \
    --cwd output/apa \
    --hg-gtf output/apa/chr22.gtf

ii. Convert the transcriptome BAM files into per-base coverage (wig) and read-depth (flagstat) files:#

sos run pipeline/apa_calling.ipynb bam2tools \
    --cwd output/apa \
    --bam-dir output/rnaseq/bam

iii. Compile the DaPars2 sample configuration and mapping files:#

sos run pipeline/apa_calling.ipynb APAconfig \
    --cwd output/apa \
    --bfile output/apa/wig \
    --annotation output/apa/chr22_3UTR.bed

iv. Use DaPars2 to quantify APA events (PDUI matrix per chromosome):#

sos run pipeline/apa_calling.ipynb APAmain \
    --cwd output/apa \
    --chrlist chr22 \
    --dapars-path code/molecular_phenotypes/calling/apa/Dapars2_Multi_Sample.py

v. Impute missing values and run quality control on the PDUI matrix:#

sos run pipeline/apa_impute.ipynb APAimpute \
    --cwd output/apa \
    --chrlist chr22

vi. Optionally rename the sample columns of the imputed PDUI matrix using a match table:#

Output#

File

Description

*_3UTR.bed

3’UTR reference regions extracted from the annotation, used by DaPars2.

Coverage (wig) and flagstat files

Per-base read coverage and read-depth summaries derived from the transcriptome BAMs.

PDUI matrix (per chromosome)

DaPars2 quantification of distal poly-A site usage, samples in columns.

Imputed PDUI matrix

The PDUI matrix after missing-value imputation and quantile normalization — the APA phenotype table used for apaQTL analysis.

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

sos run pipeline/apa_impute.ipynb APArename \
    --cwd output/apa \
    --chrlist chr22 \
    --match input/covariate/protocol_example.apa_matchtable.txt