Gradient Biotech
U

Menu

Biomarker discovery

Select predictive gene features from expression data, train cross-validated classifiers, and rank biomarker panels for translational studies.

Research question

Which genes best discriminate between sample classes, and how reliably does a classifier predict group membership in cross-validation?

Who this is for

  • Translational teams building expression-based stratification signatures
  • Pharma biomarker groups running feature selection before companion diagnostic development
  • Research labs comparing responder vs. non-responder expression profiles

Data requirements

DataRequiredPurpose
Expression matrix in AnnDataYesFeature selection and classification input
Class labels in obsYesDefault column condition, configurable per run
At least two classesYesSupervised feature selection and CV
Sufficient samples per classRecommendedReliable cross-validation metrics

Prefer pseudobulk or sample-level aggregation when biological replicates matter — cell-level labels can inflate performance.

Workflow

Upload labeled data → Explore class balance → Biomarker pipeline → Review ranked genes → Enrichment follow-up

Step 1 — Prepare labeled dataset

Upload an AnnData object with class labels in obs. Confirm sample counts and class balance in Explore before running — severe imbalance affects CV metrics.

For single-cell data, aggregate to sample or pseudobulk level when replicates define the unit of inference.

Step 2 — Run biomarker pipeline

Under Analyze → Biomarker discovery, launch the pipeline. Stages include:

StageDescription
WGCNA (optional)Co-expression networks on bulk-style matrices; skipped automatically on sparse single-cell data
Feature selectionmRMR, random forest importance, or combined rankings
ClassificationSVM, k-NN, or random forest with cross-validation

Step 3 — Review results

Analyze → Biomarker Results shows:

  • Ranked selected genes with scores
  • Classifier accuracy, F1, and confusion matrix summaries across CV folds
  • Links to Runs and Interpret

Step 4 — Pathway context

Run enrichment on the selected gene list in a follow-up step for GO/KEGG pathway context around the biomarker panel.

Step 5 — Snapshot and figures

Save a snapshot to freeze the biomarker parameter set. Add performance charts and gene ranking tables to the figure canvas for reports.

Expected outputs

  • Ranked gene panel with feature selection scores
  • Cross-validated classifier performance metrics
  • Confusion matrix summary across folds
  • Enrichment context for selected genes (when follow-up run completed)
  • Reproducible snapshot with pipeline parameters

Typical analyses

AnalysisClassesQuestion
IO responseResponder vs. non-responderWhich genes stratify checkpoint inhibitor response?
Disease subtypeSubtype A vs. B vs. CWhat expression signature defines each subtype?
Manufacturing QCPass vs. fail batchCan expression predict cell product quality?
CDx developmentTreatment vs. controlWhat minimal gene panel supports stratification?

KnowSeq alignment

The current implementation covers feature selection and ML classification. Coverage-style multi-class DEG extraction, consistency selection across resampling, and disease evidence retrieval are planned extensions.

Related guides