Skip to content

Python API

For the pytest flags, the benchmem marker, the benchmark_memory fixture, the blob schema, and the benchmem CLI, see the Reference; for narrative usage start with Getting started.

pytest_benchmem is light to import — it re-exports only the engine and the readers/loader. pytest_benchmem.plotting pulls plotly and pytest_benchmem.sweep shells out to uv, so import those submodules directly.

Engine — pytest_benchmem

measure_peak

measure_peak(action: Action, repeats: int = 1) -> int

Run action() under memray.Tracker and return peak bytes.

The bare one-liner for a REPL or notebook; :func:measure_memory returns the full result (allocation count, spread).

measure_memory

measure_memory(
    action: Action, repeats: int = 1
) -> MemoryResult

Run action() under memray.Tracker repeats times → :class:MemoryResult.

Each repeat gets a fresh tracker; the headline peak is the minimum across repeats (see :class:MemoryResult), and every repeat's :class:Measurement is retained for distribution stats.

MemoryResult dataclass

A memory measurement across repeats passes, derived from the per-repeat samples.

The per-repeat :attr:samples are the single source of truth — that's all the blob stores (the series); everything else is derived from them on read.

Peak memory is noisier than expected (GC timing, lazy imports, page cache), so the headline :attr:peak_bytes is the minimum peak — the cleanest "floor this can hit" — and :attr:allocations / :attr:total_bytes come from that same representative (min-peak) run, a coherent snapshot. :attr:peak_bytes_max is the worst peak, so the spread is visible.

repeats property

repeats: int

How many passes were measured.

representative property

representative: Measurement

The min-peak run — the one the headline peak/allocations/total_bytes come from.

peak_bytes property

peak_bytes: int

The headline peak — the minimum high-water across repeats (the cleanest floor).

peak_bytes_max property

peak_bytes_max: int

The worst peak across repeats (equals :attr:peak_bytes with one repeat).

allocations property

allocations: int

Allocation count from the representative (min-peak) run.

total_bytes property

total_bytes: int

Total bytes allocated by the representative (min-peak) run.

series

series(field: str) -> list[int]

The per-repeat values of one :data:SERIES_FIELDS field.

as_dict

as_dict() -> dict[str, Any]

The JSON blob stored under pytest-benchmark extra_info["benchmem"].

Just the three per-repeat series, flat — no denormalized scalars and no repeats (it's len of any series). Everything else derives on read.

from_blob classmethod

from_blob(blob: Mapping[str, Any]) -> MemoryResult

Rebuild from a blob's per-repeat series (one column per :data:SERIES_FIELDS).

Measurement dataclass

One repeat's raw memray stats numbers — peak high-water, allocation count, and total bytes allocated (cumulative churn, incl. temporaries GC later frees).

Readers & loader — pytest_benchmem

from_pytest_benchmark reads timing (seconds, from stats); memory_from_pytest_benchmark reads memory (bytes, from extra_info.benchmem). load_samples is the unified reader, and load_long_df stacks runs into the tidy frame every plot pivots.

from_pytest_benchmark

from_pytest_benchmark(
    path: str | Path, *, metric: str = "min"
) -> tuple[str, list[Sample], str]

Read timing out of a pytest-benchmark file → (label, samples, "s").

metric picks the stat (min / median / …). Dims come from each benchmark's parametrize params and extra_info, plus the structural node.* dims (see :func:_node_dims).

memory_from_pytest_benchmark

memory_from_pytest_benchmark(
    path: str | Path,
    *,
    field: str = "peak_bytes",
    reduce: Callable[[list[float]], float] | None = None,
) -> tuple[str, list[Sample], str]

Read memory out of a pytest-benchmark file → (label, samples, unit).

The benchmark_memory fixture stores each run's memory blob under extra_info["benchmem"] (a flat per-repeat series per field), keyed by the same benchmark id pytest-benchmark uses. field picks which series (peak_bytes → unit B, allocations → count). Without reduce the headline scalar is derived (peak = min, allocations/total_bytes = the min-peak run); pass reduce to compute a distribution stat over the series instead. Benchmarks lacking the blob (timing-only tests) are skipped. Dims come from parametrize params and extra_info, plus the structural node.* dims (see :func:_node_dims).

load_samples

load_samples(
    path: str | Path,
    *,
    metric: Metric = "time",
    stat: str | None = None,
) -> tuple[str, list[Sample], str]

Read one pytest-benchmark file for the chosen metric(label, samples, unit).

stat selects a distribution stat over the metric's per-repeat series (min / max / mean / median / stddev); None reads the headline scalar. For time it picks the pytest-benchmark stat (defaulting to min).

load_long_df

load_long_df(
    runs: str | Path | Sequence[str | Path],
    *,
    metric: Metric = "time",
    stat: str | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[pd.DataFrame, str]

Stack pytest-benchmark files (one path or a sequence) into one long frame → (df, unit).

One row per (run, id) for the chosen metric. Columns: snapshot (the series/version label), id, value, then one column per dim key seen (missing dims are NaN). Every plot view pivots this frame.

labels overrides the snapshot label per run (one per path, same order), decoupling the display name from the filename; defaults to each file's stem.

discover_runs

discover_runs(
    root: str | Path = ".benchmarks",
) -> list[Path]

Return pytest-benchmark JSON files under root (for CLI suggestions).

human_bytes

human_bytes(n: float) -> str

Auto-scale a byte count to a short IEC string: 932 B, 4.1 MiB, 2.3 GiB.

Sample

Bases: NamedTuple

One measured result: an opaque id, a value, and analysis dims.

Plotting — pytest_benchmem.plotting

Every plot_* returns (figure, n_ids). snapshots is a list of run JSON paths; labels names the series per run (defaults to the file stems) — the API behind plot's -l/--label.

plot_scaling

plot_scaling(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    x: str | None = None,
    color: str | None = None,
    facet: str | None = None,
    log: bool | Literal["auto"] = "auto",
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Cost vs a numeric dim, coloured/faceted by other dims.

x/color/facet default to inference from the dims (the lone numeric dim → x); pass them to override. log="auto" log-scales when x is numeric and strictly positive. labels names the snapshot in the title (defaults to the file stem).

plot_scatter

plot_scatter(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    facet: str | None = None,
    clip: float | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Baseline cost (log-x) vs candidate/baseline ratio (log-y).

Top-right = slow and slower (the regressed corner). The first snapshot is the baseline; with 3+, the rest animate. Colour encodes absolute Δ; clip clamps it (default p95). facet splits by any dim. labels names the snapshots (defaults to the file stems).

plot_compare

plot_compare(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    sort: SortMode = "absolute",
    facet: str | None = None,
    clip: float | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Bar chart of per-id delta, sorted by the chosen Δ (biggest regressions on top).

sort picks the bar dimension: absolute plots b - a in the native unit, relative plots percent change. facet splits into subplots by any dim. clip clamps the colour scale (default symmetric p95). labels names the two series (defaults to the file stems).

plot_sweep

plot_sweep(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    clip: float | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Heatmap of per-id fold-change (log2 ratio) vs the first snapshot.

labels names the version columns (defaults to the file stems).

Sweeps — pytest_benchmem.sweep

See Cross-version sweeps for the narrative walkthrough and the Venv the run callback receives.

sweep

sweep(
    versions: Sequence[str],
    run: Callable[[Venv], None],
    **provision_kwargs: object,
) -> list[str]

Provision a venv per version and call run(venv) in each.

run does whatever the consumer needs (invoke pytest / a memory command with venv.python and cwd=venv.cwd). Returns the list of versions that failed to provision.

provision

provision(
    versions: Sequence[str],
    *,
    install_spec: Callable[[str], str] = lambda v: v,
    pins: Sequence[str] = (),
    copy_dir: str | Path | None = None,
    import_check: str | None = None,
    as_of: str | None = None,
    tmp_prefix: str = "pytest-benchmem-",
) -> Iterator[Venv]

Yield one fresh uv venv per version.

install_spec(v) → the pip spec for the package under test (default: v verbatim, so "pkg==1.2" works). pins are extra specs installed alongside. copy_dir is copied into each venv's cwd for import isolation. import_check is a module name asserted to resolve to the venv (preflight). as_of (YYYY-MM-DD) freezes the whole resolution via --exclude-newer.

Venv dataclass

One provisioned per-version venv (or a failure).

failure classmethod

failure(version: str, stage: str) -> Venv

A provisioning failure at stage — no python/env/cwd, just the reason.