Biomarker discovery

The biomarker pipeline supports translational workflows: feature selection from expression data, reduction to a predictive gene panel, and cross-validated classifier performance.

Inputs

Expression matrix in AnnData (cells or samples as observations)
Class labels in obs — default column name condition, configurable per run
At least two classes with sufficient samples per class

Pipeline stages

Optional WGCNA

Co-expression network analysis runs on bulk-style matrices. On single-cell data WGCNA is skipped automatically because module structure is unreliable at cell-level sparsity.

Feature selection

Method	Description
mRMR	Minimum redundancy maximum relevance; capped input size for performance
Random forest	Importance ranking; used automatically for large gene sets
Combined	Merges rankings from multiple methods

Classification

Trains a classifier (SVM, k-NN, or random forest) with cross-validation and reports accuracy, F1, and confusion matrix summaries.

Results

Analyze → Biomarker Results shows:

Ranked selected genes with scores
Classifier performance metrics across CV folds
Links from Runs and Interpret

KnowSeq alignment

The current implementation covers feature selection and ML classification. Not yet implemented:

Coverage-style multi-class DEG extraction across all pairwise comparisons
Consistency selection across resampling
Disease evidence retrieval from external databases

Tips

Prefer pseudobulk or sample-level aggregation when biological replicates matter; cell-level labels inflate performance.
Check class balance in Explore before running; severe imbalance affects CV metrics.
Use enrichment on the selected gene list in a follow-up DE/enrichment run for pathway context.

Menu

Documentation