Key concepts
Computational Biology in Gradient BioSystems is organized around molecular measurements, statistical comparisons, and reproducible analysis runs. This page explains both the biology concepts behind the workflows and the app concepts used to manage them.
Gene expression data
Most workflows start from a gene expression matrix. Rows usually represent cells, spots, or samples. Columns represent genes or other measured features. Each value is a count or normalized expression estimate for one feature in one observation.
Common observation types:
- Cell: one captured cell in a single-cell ribonucleic acid sequencing (single-cell RNA-seq) experiment.
- Sample: one biological sample in a bulk ribonucleic acid sequencing (bulk RNA-seq) experiment.
- Spot: one spatial capture location on a tissue slide.
Common feature types:
- Gene: a transcribed genomic locus measured by RNA sequencing.
- Transcript: an isoform-level measurement, when available.
- Protein or antibody tag: a cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq) or multimodal feature, when present in the dataset.
Raw expression counts are not directly comparable across cells or samples because sequencing depth, capture efficiency, and technical noise vary. Analysis workflows therefore apply quality control, normalization, feature selection, and statistical modeling before interpretation.
Single-cell RNA sequencing
Single-cell ribonucleic acid sequencing (single-cell RNA-seq) measures gene expression in many individual cells. The goal is usually to find cell populations, describe marker genes, compare conditions, or infer biological programs.
Core single-cell concepts:
| Concept | Meaning |
|---|---|
| Cell barcode | Identifier for one captured cell or droplet. |
| Unique molecular identifier (UMI) / count | Molecular count used to estimate expression abundance. |
| Quality control (QC) metrics | Measurements such as total counts, detected genes, and mitochondrial percentage used to flag low-quality cells. |
| Normalization | Scaling expression so cells can be compared more fairly. |
| Highly variable genes | Genes with informative variation across cells; often used for clustering. |
| Dimensionality reduction | Methods such as principal component analysis (PCA) and uniform manifold approximation and projection (UMAP) that summarize high-dimensional expression patterns. |
| Neighbor graph | A graph connecting transcriptionally similar cells. |
| Cluster | A group of cells with similar expression profiles. |
| Marker gene | A gene enriched in one cluster, condition, or cell type. |
In the app, quality control, normalization, clustering, differential expression, enrichment, and figure generation are separate runs so that each step has recorded inputs and parameters.
Clustering and cell states
Clustering groups cells with similar expression profiles. Clusters often correspond to cell types, activation states, differentiation stages, or technical artifacts. A cluster is not automatically a biological cell type. It becomes interpretable when marker genes, metadata, known biology, and experimental design agree.
For example, a cluster enriched for CD3D, CD3E, and TRAC may represent T cells, while a cluster enriched for MS4A1 and CD79A may represent B cells. Marker genes should be checked against the tissue, species, disease context, and sample metadata before assigning labels.
Differential expression
Differential expression compares gene expression between groups. A contrast might compare treated versus control samples, one cluster versus all other cells, or disease versus healthy tissue.
Important terms:
- Contrast: the comparison being tested, such as
treated vs control. - Log fold change: the estimated expression difference on a log scale.
- P value: evidence against the null hypothesis for one gene.
- Adjusted P value / false discovery rate (FDR): multiple-testing corrected significance estimate.
- Effect size: the magnitude of the expression change, which should be considered alongside statistical significance.
Statistical significance does not guarantee biological importance. Strong interpretation should consider replicate structure, batch effects, sample balance, effect size, and whether the gene is plausible for the tissue and condition.
Pathway enrichment
Pathway enrichment maps a gene list to known biological processes, pathways, or gene sets. It helps answer questions such as "Which immune programs are overrepresented in the upregulated genes?" or "Do these markers point to cell cycle, hypoxia, interferon response, or metabolism?"
Typical inputs are ranked or filtered gene lists from differential expression or marker analysis. Typical outputs include enriched pathways, overlap counts, enrichment scores, and adjusted P values.
Enrichment is a summarization method, not proof that a pathway is active. Results depend on the gene universe, database, species mapping, threshold choices, and whether the input gene list reflects the biology being asked about.
Spatial transcriptomics
Spatial transcriptomics preserves tissue coordinates while measuring gene expression. Instead of only asking which cells or spots are similar, spatial analysis asks where expression programs occur in the tissue.
Core spatial concepts:
- Spot or capture location: a measured location on the tissue slide.
- Tissue coordinate: x/y position used for spatial plots.
- Spatial domain: a region of tissue with similar expression or morphology-associated signal.
- Gene overlay: expression of one gene plotted over tissue coordinates.
- Spatial marker: a gene enriched in a spatial domain or anatomical region.
Spatial domains should be interpreted with histology, tissue landmarks, and known anatomy when available. Expression alone may identify structure, but tissue context is needed to name it confidently.
Biomarker discovery
Biomarker discovery searches for genes or features that distinguish groups, predict labels, or summarize a biological state. In this product area it is intended as exploratory feature discovery and model evaluation, not a clinical diagnostic claim.
Important concepts:
- Candidate feature: a gene or measurement considered as a possible marker.
- Feature selection: choosing a smaller set of informative features from many measured genes.
- Classifier: a model trained to distinguish labels such as disease versus control.
- Cross-validation: repeated train/test splitting used to estimate model generalization.
- Ranked panel: an ordered list of candidate markers and their scores.
Good biomarker analysis needs independent validation, careful handling of batch effects, and enough biological replicates to avoid learning study-specific artifacts.
Bulk RNA sequencing
Bulk ribonucleic acid sequencing (bulk RNA-seq) measures average expression across many cells in each sample. It is useful for comparing conditions across biological replicates but does not resolve individual cell populations.
Bulk analysis depends heavily on:
- Replicates: independent biological samples within each group.
- Design matrix: the statistical model describing condition, batch, donor, or other covariates.
- Contrast: the specific comparison tested by the model.
- Normalization: sample-level scaling for library size and composition.
The application programming interface (API) supports bulk RNA-seq analysis. The guided user interface (UI) is still being expanded, so bulk workflows may expose fewer interactive controls than single-cell and spatial workflows.
Metadata and experimental design
Metadata maps experimental facts to observations. Examples include donor, condition, batch, tissue, timepoint, treatment, sample id, and cluster label. In single-cell data, metadata usually lives at the cell level. In bulk data, it usually lives at the sample level.
Experimental design defines how metadata should be used statistically. This matters because the same expression matrix can support different questions depending on the grouping variable, replicate structure, and covariates.
Examples:
- Compare treatment within one cell type.
- Compare disease versus control while accounting for batch.
- Find markers for one cluster against all other clusters.
- Test spatial domains for region-specific expression.
Study
A study is the top-level container for one research project. It holds datasets, metadata, design tables, contrasts, pipeline runs, snapshots, and figures. In the application programming interface and database the same object is called an experiment; the user interface uses "study" for clarity.
Dataset and AnnData
A dataset is an uploaded or converted data file attached to a study. The internal format is AnnData (.h5ad), a common format for annotated expression matrices.
AnnData stores:
X: the expression matrix.obs: observation metadata, such as cells, spots, or samples.var: feature metadata, usually genes.obsm: embeddings or coordinates, such as uniform manifold approximation and projection coordinates or spatial positions.uns: unstructured results and analysis metadata.
Raw uploads may be comma-separated values (CSV) count matrices, 10x Genomics outputs, or pre-built .h5ad files. The Convert action on the Data page turns supported formats into processed AnnData stored on disk.
Each dataset records observation count, feature count, modality, ingestion status, and any conversion errors.
Pipeline run
A run is one execution of an analysis step, such as quality control, clustering, differential expression, enrichment, spatial analysis, or biomarker discovery.
Each run stores:
- Input
dataset_id - Full JavaScript Object Notation (JSON) parameter record
- Status (
running,completed,failed) - Result payload or error message
- Who started the run and when
Runs chain through checkpoints. For example, normalization reads the QC output, clustering reads normalization, and downstream interpretation reads completed analysis results.
Snapshot
A snapshot freezes a study state so you can restore or compare analyses later. Snapshots store the set of completed runs, their parameters, and fingerprints of metadata, design, and contrasts. The app warns you when the study has changed since the snapshot was taken.
Stale outputs
Results become stale when upstream data or parameters change. Examples include metadata edited after differential expression, quality control re-run after clustering, or contrasts changed after enrichment. The Runs page lists stale outputs with links to re-run affected steps.
Provenance
Every figure and interpretation should trace back to a specific run with recorded parameters. The Interpret -> Methods Provenance panel shows which runs produced the current results. Artificial intelligence (AI)-generated summaries, when enabled, attach to the same run records.
Interpretation scope
The app helps organize analysis and generate reproducible outputs, but the biological interpretation still depends on the study design, sample quality, assay limitations, and domain expertise. Treat computational outputs as evidence to evaluate, not automatic conclusions.