Methylation Data Preprocessing#
Description#
This mini-protocol turns raw DNA-methylation array data into an analysis-ready methylation phenotype matrix for methylation-QTL (mQTL) analysis. It chains two pipeline modules. First, methylation calling reads the raw Illumina IDAT intensity files with the SeSAMe package and computes a normalized methylation level (M-value) for every probe in every sample. Second, because array data inevitably contain probes that fail in some samples, a missing-value imputation step fills the resulting gaps so that downstream tools receive a complete matrix. Follow the two steps in order; each is a single command on the toy data.
Input#
File |
Description |
|---|---|
IDAT files |
Raw Illumina methylation-array intensity files, one pair (Grn/Red) per sample, collected under a single directory. |
Sample sheet ( |
Tab-delimited sheet describing each sample and pointing to its IDAT files. The |
Container ( |
Singularity image providing SeSAMe and its dependencies. |
Steps#
i. Methylation calling#
Call methylation levels from the collection of IDAT files with SeSAMe. The sesame workflow reads the sample sheet, locates each sample’s IDAT pair, and writes a per-probe methylation (M-value) matrix as a bed.gz phenotype file. The first command runs locally; the second submits the same job to an HPC queue (-q csg -c csg2.yml). On the toy dataset this step takes roughly 26 minutes and up to ~15 GB of memory.
sos run pipeline/methylation_calling.ipynb sesame \
--sample-sheet input_data/Methylation/xqtl_protocol_data_arrayMethylation_covariates.tsv \
--container containers/methylation.sif --sample_sheet_header_rows 0 --cwd output_rerun/methylation/
sos run pipeline/methylation_calling.ipynb sesame \
--sample-sheet input_data/Methylation/xqtl_protocol_data_arrayMethylation_covariates.tsv \
--container containers/methylation.sif --sample_sheet_header_rows 0 --cwd output_rerun/methylation/ -q csg -c csg2.yml -J 1 &
ii. Missing-value imputation#
Array methylation matrices typically contain probes with missing (NA) values in some samples. The bed_filter_na workflow applies soft-impute to fill these gaps, producing a complete matrix that downstream covariate, association, and fine-mapping steps can consume. On the toy dataset this step takes about 2.5 minutes and up to ~8 GB of memory.
sos run pipeline/phenotype_imputation.ipynb bed_filter_na \
--phenoFile output/methylation/xqtl_protocol_data_arrayMethylation_covariates.sesame.M.bed.gz \
--cwd ./output/methylation/
Output#
File |
Description |
|---|---|
|
Per-probe methylation (M-value) matrix produced by the calling step, samples in columns and probes in rows. |
Imputed |
The same matrix after soft-impute, with no remaining missing values — the methylation phenotype table used for mQTL analysis. |
Anticipated Results#
The pipeline produces a complete, normalized DNA-methylation phenotype matrix (*.sesame.M.bed.gz, after imputation) covering all probes across all samples in the sample sheet. This matrix is the methylation equivalent of the expression bed.gz files and serves as the phenotype input to the covariate-preprocessing and QTL-association sections of the protocol.