Biomarker discovery
Select predictive gene features from expression data, train cross-validated classifiers, and rank biomarker panels for translational studies.
Research question
Which genes best discriminate between sample classes, and how reliably does a classifier predict group membership in cross-validation?
Who this is for
- Translational teams building expression-based stratification signatures
- Pharma biomarker groups running feature selection before companion diagnostic development
- Research labs comparing responder vs. non-responder expression profiles
Data requirements
| Data | Required | Purpose |
|---|---|---|
| Expression matrix in AnnData | Yes | Feature selection and classification input |
Class labels in obs | Yes | Default column condition, configurable per run |
| At least two classes | Yes | Supervised feature selection and CV |
| Sufficient samples per class | Recommended | Reliable cross-validation metrics |
Prefer pseudobulk or sample-level aggregation when biological replicates matter — cell-level labels can inflate performance.
Workflow
Upload labeled data → Explore class balance → Biomarker pipeline → Review ranked genes → Enrichment follow-up
Step 1 — Prepare labeled dataset
Upload an AnnData object with class labels in obs. Confirm sample counts and class balance in Explore before running — severe imbalance affects CV metrics.
For single-cell data, aggregate to sample or pseudobulk level when replicates define the unit of inference.
Step 2 — Run biomarker pipeline
Under Analyze → Biomarker discovery, launch the pipeline. Stages include:
| Stage | Description |
|---|---|
| WGCNA (optional) | Co-expression networks on bulk-style matrices; skipped automatically on sparse single-cell data |
| Feature selection | mRMR, random forest importance, or combined rankings |
| Classification | SVM, k-NN, or random forest with cross-validation |
Step 3 — Review results
Analyze → Biomarker Results shows:
- Ranked selected genes with scores
- Classifier accuracy, F1, and confusion matrix summaries across CV folds
- Links to Runs and Interpret
Step 4 — Pathway context
Run enrichment on the selected gene list in a follow-up step for GO/KEGG pathway context around the biomarker panel.
Step 5 — Snapshot and figures
Save a snapshot to freeze the biomarker parameter set. Add performance charts and gene ranking tables to the figure canvas for reports.
Expected outputs
- Ranked gene panel with feature selection scores
- Cross-validated classifier performance metrics
- Confusion matrix summary across folds
- Enrichment context for selected genes (when follow-up run completed)
- Reproducible snapshot with pipeline parameters
Typical analyses
| Analysis | Classes | Question |
|---|---|---|
| IO response | Responder vs. non-responder | Which genes stratify checkpoint inhibitor response? |
| Disease subtype | Subtype A vs. B vs. C | What expression signature defines each subtype? |
| Manufacturing QC | Pass vs. fail batch | Can expression predict cell product quality? |
| CDx development | Treatment vs. control | What minimal gene panel supports stratification? |
KnowSeq alignment
The current implementation covers feature selection and ML classification. Coverage-style multi-class DEG extraction, consistency selection across resampling, and disease evidence retrieval are planned extensions.