snRNA-seq Preprocessing#
Description#
This pipeline preprocesses single-nuclei RNA-seq data in two stages. First it performs quality control with SCTK and Seurat: removing low-quality cells by mitochondrial content, total counts (nUMI) and detected genes (nFeature), filtering ambient RNA contamination with decontX, and removing doublets with a user-selected method (scds, scrublet, doubletFinder, or doubletCells). It then annotates cell types on the QC-filtered data using SingleR against a reference, transferring cluster-majority labels to the full Seurat object so the result is ready for pseudobulk aggregation.
Input Files#
File |
Description |
|---|---|
|
CellRanger output directory (contains the per-sample counts folder with |
|
Sample barcode-to-patient ID mapping (columns: |
|
SingleR reference SummarizedExperiment object (used in Step 2) |
Steps#
Step 1. QC filtering (sctk_qc)#
Runs SCTK/Seurat QC on the CellRanger counts: it drops cells failing the mitochondrial-percent, minimum-count (nUMI) and minimum-gene (nFeature) thresholds, removes ambient RNA with decontX, and removes doublets with the chosen method (default scds). The sample-meta CSV maps cell barcodes to individual IDs. The result is a QC-filtered, clustered Seurat object.
sos run pipeline/snRNAseq_preprocessing.ipynb sctk_qc \
--input-dir input/snrnaseq/protocol_example.snrnaseq.cellranger \
--output-dir output/snrna_seq \
--sample-meta input/snrnaseq/protocol_example.snrnaseq.id_mapping.csv
Step 2. Cell type annotation (cell_annotation)#
Annotates cell types on the QC-filtered object using SingleR against the reference SummarizedExperiment. Cells are downsampled per cluster (default 1000) for the SingleR call, then a cluster-majority label is assigned when at least cluster_threshold (default 0.90) of the cluster agrees, and transferred back to the full Seurat object.
sos run pipeline/snRNAseq_preprocessing.ipynb cell_annotation \
--sctk-rds output/snrna_seq/SCTK_results/filtered_seuratobj.rds \
--output-dir output/snrna_seq \
--seurat-ref input/snrnaseq/protocol_example.snrnaseq.seurat_ref_SE.rds
Output Files#
File |
Description |
|---|---|
|
QC-filtered, clustered Seurat object (from Step 1) |
|
Per-sample SCTK QC metrics |
|
Cell/gene counts at each QC step |
|
Downsampled Seurat object with per-cell SingleR labels (from Step 2) |
|
Full Seurat object with cluster-majority cell type labels (from Step 2) |
Anticipated Results#
Step 1 yields a QC-filtered Seurat object plus QC tables summarizing how many cells and genes survive each filtering step. Step 2 adds cell type labels, producing a celltype-labeled Seurat object ready for downstream pseudobulk aggregation.
[global]
parameter: cwd = path("output")
parameter: job_size = 1
parameter: walltime = "5h"
parameter: mem = "16G"
parameter: numThreads = 8
parameter: container = ""
import re
from sos.utils import expand_size
) if container else ""
cwd = path(f'{cwd:a}')
[sctk_qc]
parameter: input_dir = str
parameter: output_dir = str
parameter: sample_meta = str
parameter: max_mito_percent = 5
parameter: min_ncounts = 250
parameter: min_ngenes = 500
parameter: doublet_method = 'scds' # options: 'scds', 'scrublet', 'doubletFinder', 'doubletCells', 'none'
parameter: scds_call = 'Singlet'
parameter: scrublet_threshold = 0.25
parameter: ambient_method = 'decontX' # options: 'decontX', 'none'
parameter: max_decontXsc = 0.5
import os
input: []
output: f'{output_dir}/SCTK_results/filtered_seuratobj.rds',
f'{output_dir}/QC_table/SCTK_QC_table.csv'
task: trunk_workers = 1, trunk_size = 1, walltime = '36:00:00', mem = '256G', cores = 16
R: expand = "${ }", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'
options(future.globals.maxSize = 8 * 1024^3)
library(future)
plan(multicore, workers = 16)
library(Seurat)
library(singleCellTK)
library(tidyverse)
library(SummarizedExperiment)
library(scuttle)
library(dplyr)
library(SCOPfunctions)
input_dir <- "${input_dir}"
qc_table_out <- "${_output[1]}"
rds_out <- "${_output[0]}"
doublet_method <- "${doublet_method}"
ambient_method <- "${ambient_method}"
# Validate method choices
valid_doublet <- c("scds", "scrublet", "doubletFinder", "doubletCells", "none")
valid_ambient <- c("decontX", "none")
if (!doublet_method %in% valid_doublet)
stop("Invalid doublet_method '", doublet_method, "'. Choose from: ", paste(valid_doublet, collapse = ", "))
if (!ambient_method %in% valid_ambient)
stop("Invalid ambient_method '", ambient_method, "'. Choose from: ", paste(valid_ambient, collapse = ", "))
message("Doublet detection method : ", doublet_method)
message("Ambient RNA removal method: ", ambient_method)
# ── Create output directories ─────────────────────────────────
dir.create(dirname(rds_out), showWarnings = FALSE, recursive = TRUE)
dir.create(dirname(qc_table_out), showWarnings = FALSE, recursive = TRUE)
# ── QC summary table ──────────────────────────────────────────
qc_summary <- data.frame(Step = character(),
Cells = numeric(), Genes = numeric(),
stringsAsFactors = FALSE)
update_summary <- function(step, obj) {
qc_summary <<- rbind(qc_summary, data.frame(
Step = step,
Cells = ncol(obj), Genes = nrow(obj)))
}
# ── Sample metadata ───────────────────────────────────────────
samplelist <- read.csv("${sample_meta}")
samplelist$column_name <- paste0(samplelist$libraryBatch, "-counts_", samplelist$cellBarcode)
samplelist <- samplelist[, c("individualID", "column_name")]
colnames(samplelist) <- c("sample", "column_name")
# ── Helper: convert SCE to Seurat ─────────────────────────────
convert_to_seurat <- function(sce) {
obj <- CreateSeuratObject(counts = counts(sce),
meta.data = as.data.frame(colData(sce)),
min.cells = ncol(sce) * 0.005)
update_summary("Removing Genes expressed in < 0.5% of cells", obj)
return(obj)
}
# ── Helper: assign sample IDs, convert back to SCE ────────────
process_sce <- function(seurat_obj, samplelist) {
seurat_obj <- seurat_obj[, colnames(seurat_obj) %in% samplelist$column_name]
update_summary("Removing Cells with no associated Patient ID", seurat_obj)
seurat_obj@meta.data$sample <- NULL
seurat_obj@meta.data <- seurat_obj@meta.data %>%
inner_join(samplelist, by = "column_name")
seurat_obj@meta.data <- seurat_obj@meta.data[
!duplicated(seurat_obj@meta.data$column_name), ]
rownames(seurat_obj@meta.data) <- seurat_obj@meta.data$column_name
return(as.SingleCellExperiment(seurat_obj))
}
# ── Helper: build QC algorithms list from chosen methods ──────
# Doublet detection methods (runCellQC YAML names):
# "scds" → cxds_bcds_hybrid (cxds + bcds hybrid score)
# "scrublet" → scrublet
# "doubletFinder" → doubletFinder
# "doubletCells" → doubletCells (scran-based)
# "none" → skip doublet detection
# Ambient RNA methods:
# "decontX" → decontX
# "none" → skip ambient RNA removal
build_algorithms <- function(doublet_method, ambient_method) {
algos <- "QCMetrics"
if (doublet_method == "scds") algos <- c(algos, "cxds_bcds_hybrid")
if (doublet_method == "scrublet") algos <- c(algos, "scrublet")
if (doublet_method == "doubletFinder") algos <- c(algos, "doubletFinder")
if (doublet_method == "doubletCells") algos <- c(algos, "doubletCells")
if (ambient_method == "decontX") algos <- c(algos, "decontX")
return(algos)
}
# ── Helper: build filter strings from chosen methods ──────────
build_filters <- function(doublet_method, ambient_method) {
filters <- c(
"mito_percent < ${max_mito_percent}",
"total > ${min_ncounts}",
"detected > ${min_ngenes}"
)
if (doublet_method == "scds") {
filters <- c(filters, "scds_hybrid_call=='${scds_call}'")
} else if (doublet_method == "scrublet") {
filters <- c(filters, "scrublet_score < ${scrublet_threshold}")
} else if (doublet_method == "doubletFinder") {
filters <- c(filters, "doubletFinder_doublet_label == 'Singlet'")
} else if (doublet_method == "doubletCells") {
filters <- c(filters, "scDblFinder_class == 'singlet'")
}
if (ambient_method == "decontX") {
filters <- c(filters, "decontX_contamination < ${max_decontXsc}")
}
return(filters)
}
# ── Helper: run SCTK-QC, filter, cluster ─────────────────────
process_sce_to_seurat <- function(sce) {
algos <- build_algorithms(doublet_method, ambient_method)
message("Running QC algorithms: ", paste(algos, collapse = ", "))
sce <- runCellQC(sce, sample = NULL,
algorithms = algos,
mitoRef = "human",
mitoIDType = "ensembl",
mitoGeneLocation = "rownames",
seed = 12345)
# ── Write per-sample QC table ─────────────────────────────
sce_temp <- sampleSummaryStats(sce, sample = colData(sce)$sample, simple = FALSE)
sst <- getSampleSummaryStatsTable(sce_temp, statsName = "qc_table")
write.csv(sst, qc_table_out)
rm(sce_temp)
# ── Apply filters ─────────────────────────────────────────
qc_filters <- build_filters(doublet_method, ambient_method)
message("Applying filters:")
for (f in qc_filters) message(" ", f)
sce <- subsetSCECols(sce, colData = qc_filters)
update_summary("SCTK-QC Results", sce)
# ── Convert to Seurat and cluster ─────────────────────────
seurat_obj <- CreateSeuratObject(counts = counts(sce),
meta.data = as.data.frame(colData(sce)),
min.cells = ncol(sce) * 0.005)
update_summary("Final Seurat Object", seurat_obj)
seurat_obj <- NormalizeData(seurat_obj, normalization.method = "LogNormalize", scale.factor = 10000)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- RunUMAP(seurat_obj, dims = 1:10, seed.use = 1)
seurat_obj <- FindNeighbors(seurat_obj)
seurat_obj <- FindClusters(seurat_obj)
return(seurat_obj)
}
# ── 1. Import ─────────────────────────────────────────────────
sce <- importCellRanger(cellRangerDirs = input_dir)
update_summary("Raw Data", sce)
# ── 2. Remove fake mapping genes ──────────────────────────────
fake_genes <- grep("^(1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|X|Y)_",
rownames(sce), value = TRUE)
sce <- sce[setdiff(rownames(sce), fake_genes), ]
update_summary("Removing FMGs", sce)
# ── 3. Convert → filter → QC → cluster ───────────────────────
seurat_obj <- convert_to_seurat(sce); rm(sce); gc()
processed_sce <- process_sce(seurat_obj, samplelist); rm(seurat_obj); gc()
processed_seurat <- process_sce_to_seurat(processed_sce)
# ── 4. Save RDS ───────────────────────────────────────────────
saveRDS(processed_seurat, rds_out)
message("Saved: ", rds_out)
rm(processed_seurat); gc()
qc_csv <- file.path(dirname(rds_out), "QC_summary.csv")
write.csv(qc_summary, qc_csv, row.names = FALSE)
message("QC summary saved: ", qc_csv)
[cell_annotation]
parameter: sctk_rds = str
parameter: output_dir = str
parameter: seurat_ref = str
parameter: downsample_n = 1000
parameter: cluster_threshold = 0.90
import os
input: []
output: f'{output_dir}/celltyped_seuratobj.rds'
task: trunk_workers = 1, trunk_size = 1, walltime = '36:00:00', mem = '256G', cores = 16
R: expand = "${ }", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'
options(future.globals.maxSize = 16 * 1024^3)
library(future)
plan(multicore, workers = 16)
library(Seurat)
library(singleCellTK)
library(tidyverse)
library(SummarizedExperiment)
library(scuttle)
library(SingleR)
library(org.Hs.eg.db)
library(dplyr)
sctk_file <- "${sctk_rds}"
output_dir <- "${output_dir}"
downsample_n <- ${downsample_n}
cluster_threshold <- ${cluster_threshold}
anno_file <- file.path(output_dir, "annotated_seuratobj.rds")
out_file <- "${_output}"
# ── Create output directory ───────────────────────────────────
dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
get_most_common_label <- function(cell_types, threshold = cluster_threshold) {
label_table <- table(cell_types)
most_common <- max(label_table)
if ((most_common / sum(label_table)) >= threshold) {
return(names(which.max(label_table)))
} else {
return(NA)
}
}
# 1. Load and downsample
seuratObj <- readRDS(sctk_file)
seuratObj <- subset(x = seuratObj, downsample = downsample_n)
# 2. Extract counts, strip Ensembl version suffixes
sce_count <- GetAssayData(seuratObj[["RNA"]], layer = "counts")
rownames(sce_count) <- gsub("\\..*$", "", rownames(sce_count))
rm(seuratObj); gc()
# 3. Ensembl to HGNC symbol conversion
IDs <- select(org.Hs.eg.db,
keys = rownames(sce_count),
columns = c("ENSEMBL", "SYMBOL"),
keytype = "ENSEMBL")
colnames(IDs) <- c("ensembl_gene_id", "hgnc_symbol")
IDs <- IDs[!is.na(IDs$hgnc_symbol) & IDs$hgnc_symbol != "", ]
IDs <- IDs[!duplicated(IDs$hgnc_symbol), ]
IDs <- IDs[!duplicated(IDs$ensembl_gene_id), ]
sce_count <- sce_count[rownames(sce_count) %in% IDs$ensembl_gene_id, ]
rownames(IDs) <- IDs$ensembl_gene_id
rownames(sce_count) <- IDs[rownames(sce_count), "hgnc_symbol"]
rm(IDs); gc()
# 4. Load reference, intersect, free reference immediately after
seurat_ref_SE <- readRDS("${seurat_ref}")
common_genes <- intersect(rownames(sce_count), rownames(seurat_ref_SE))
ref_subset <- seurat_ref_SE[common_genes, ]
rm(seurat_ref_SE); gc()
sce_count <- sce_count[common_genes, ]
sce_SE <- SummarizedExperiment(assays = list(counts = sce_count))
sce_SE <- logNormCounts(sce_SE)
rm(sce_count); gc()
# 5. SingleR annotation
singleR_res <- SingleR(test = sce_SE,
ref = ref_subset,
labels = ref_subset$ref_label)
rm(sce_SE, ref_subset); gc()
# 6. Add labels to downsampled object
seuratObj <- readRDS(sctk_file)
seuratObj <- subset(x = seuratObj, downsample = downsample_n)
anno_df <- data.frame(celltype = singleR_res$labels,
column_name = rownames(singleR_res),
stringsAsFactors = FALSE)
seuratObj@meta.data <- seuratObj@meta.data %>%
inner_join(anno_df, by = "column_name")
rownames(seuratObj@meta.data) <- seuratObj@meta.data$column_name
saveRDS(seuratObj, anno_file)
message("Saved annotated (downsampled): ", anno_file)
# 7. Transfer labels to full object
full_obj <- readRDS(sctk_file)
full_obj@meta.data$cell <- rownames(full_obj@meta.data)
singleR_data <- data.frame(
celltype = seuratObj@meta.data$celltype,
seurat_clusters = seuratObj@meta.data$seurat_clusters
)
rm(seuratObj); gc()
most_common_labels <- singleR_data %>%
group_by(seurat_clusters) %>%
summarise(most_common_label = get_most_common_label(celltype), .groups = "drop")
merged_meta <- merge(full_obj@meta.data, most_common_labels,
by = "seurat_clusters", all.x = TRUE)
ordered_meta <- merged_meta[match(colnames(full_obj@assays$RNA), merged_meta$cell), ]
full_obj@meta.data <- ordered_meta
rownames(full_obj@meta.data) <- full_obj@meta.data$cell
full_obj@meta.data$celltype <- full_obj@meta.data$most_common_label
Idents(full_obj) <- full_obj@meta.data$celltype
saveRDS(full_obj, out_file)
message("Saved celltyped: ", out_file)
rm(full_obj, singleR_res, anno_df, singleR_data,
most_common_labels, merged_meta, ordered_meta); gc()