Biomarker discovery
The biomarker pipeline supports translational workflows: feature selection from expression data, reduction to a predictive gene panel, and cross-validated classifier performance.
Inputs
- Expression matrix in AnnData (cells or samples as observations)
- Class labels in
obs— default column namecondition, configurable per run - At least two classes with sufficient samples per class
Pipeline stages
Optional WGCNA
Co-expression network analysis runs on bulk-style matrices. On single-cell data WGCNA is skipped automatically because module structure is unreliable at cell-level sparsity.
Feature selection
| Method | Description |
|---|---|
| mRMR | Minimum redundancy maximum relevance; capped input size for performance |
| Random forest | Importance ranking; used automatically for large gene sets |
| Combined | Merges rankings from multiple methods |
Classification
Trains a classifier (SVM, k-NN, or random forest) with cross-validation and reports accuracy, F1, and confusion matrix summaries.
Results
Analyze → Biomarker Results shows:
- Ranked selected genes with scores
- Classifier performance metrics across CV folds
- Links from Runs and Interpret
KnowSeq alignment
The current implementation covers feature selection and ML classification. Not yet implemented:
- Coverage-style multi-class DEG extraction across all pairwise comparisons
- Consistency selection across resampling
- Disease evidence retrieval from external databases
Tips
- Prefer pseudobulk or sample-level aggregation when biological replicates matter; cell-level labels inflate performance.
- Check class balance in Explore before running; severe imbalance affects CV metrics.
- Use enrichment on the selected gene list in a follow-up DE/enrichment run for pathway context.