Biomarker discovery

Select predictive gene features from expression data, train cross-validated classifiers, and rank biomarker panels for translational studies.

Research question

Which genes best discriminate between sample classes, and how reliably does a classifier predict group membership in cross-validation?

Who this is for

Translational teams building expression-based stratification signatures
Pharma biomarker groups running feature selection before companion diagnostic development
Research labs comparing responder vs. non-responder expression profiles

Data requirements

Data	Required	Purpose
Expression matrix in AnnData	Yes	Feature selection and classification input
Class labels in `obs`	Yes	Default column `condition`, configurable per run
At least two classes	Yes	Supervised feature selection and CV
Sufficient samples per class	Recommended	Reliable cross-validation metrics

Prefer pseudobulk or sample-level aggregation when biological replicates matter — cell-level labels can inflate performance.

Workflow

Upload labeled data → Explore class balance → Biomarker pipeline → Review ranked genes → Enrichment follow-up

Step 1 — Prepare labeled dataset

Upload an AnnData object with class labels in obs. Confirm sample counts and class balance in Explore before running — severe imbalance affects CV metrics.

For single-cell data, aggregate to sample or pseudobulk level when replicates define the unit of inference.

Step 2 — Run biomarker pipeline

Under Analyze → Biomarker discovery, launch the pipeline. Stages include:

Stage	Description
WGCNA (optional)	Co-expression networks on bulk-style matrices; skipped automatically on sparse single-cell data
Feature selection	mRMR, random forest importance, or combined rankings
Classification	SVM, k-NN, or random forest with cross-validation

Step 3 — Review results

Analyze → Biomarker Results shows:

Ranked selected genes with scores
Classifier accuracy, F1, and confusion matrix summaries across CV folds
Links to Runs and Interpret

Step 4 — Pathway context

Run enrichment on the selected gene list in a follow-up step for GO/KEGG pathway context around the biomarker panel.

Step 5 — Snapshot and figures

Save a snapshot to freeze the biomarker parameter set. Add performance charts and gene ranking tables to the figure canvas for reports.

Expected outputs

Ranked gene panel with feature selection scores
Cross-validated classifier performance metrics
Confusion matrix summary across folds
Enrichment context for selected genes (when follow-up run completed)
Reproducible snapshot with pipeline parameters

Typical analyses

Analysis	Classes	Question
IO response	Responder vs. non-responder	Which genes stratify checkpoint inhibitor response?
Disease subtype	Subtype A vs. B vs. C	What expression signature defines each subtype?
Manufacturing QC	Pass vs. fail batch	Can expression predict cell product quality?
CDx development	Treatment vs. control	What minimal gene panel supports stratification?

KnowSeq alignment

The current implementation covers feature selection and ML classification. Coverage-style multi-class DEG extraction, consistency selection across resampling, and disease evidence retrieval are planned extensions.

Menu

Documentation