Python API¶

For the pytest flags, the benchmem marker, the benchmark_memory fixture, the blob schema, and the benchmem CLI, see the Reference; for narrative usage start with Getting started.

pytest_benchmem is light to import — it re-exports only the engine and the readers/loader. pytest_benchmem.plotting pulls plotly and pytest_benchmem.sweep shells out to uv, so import those submodules directly.

Engine — `pytest_benchmem`¶

measure_peak ¶

measure_peak(action: Action, repeats: int = 1) -> int

Run action() under memray.Tracker and return peak bytes.

The bare one-liner for a REPL or notebook; :func:measure_memory returns the full result (allocation count, spread).

measure_memory ¶

measure_memory(
    action: Action, repeats: int = 1
) -> MemoryResult

Run action() under memray.Tracker repeats times → :class:MemoryResult.

Each repeat gets a fresh tracker; the headline peak is the minimum across repeats (see :class:MemoryResult), and every repeat's :class:Measurement is retained for distribution stats.

MemoryResult `dataclass` ¶

A memory measurement across repeats passes, derived from the per-repeat samples.

The per-repeat :attr:samples are the single source of truth — that's all the blob stores (the series); everything else is derived from them on read.

Peak memory is noisier than expected (GC timing, lazy imports, page cache), so the headline :attr:peak_bytes is the minimum peak — the cleanest "floor this can hit" — and :attr:allocations / :attr:total_bytes come from that same representative (min-peak) run, a coherent snapshot. :attr:peak_bytes_max is the worst peak, so the spread is visible.

repeats `property` ¶

repeats: int

How many passes were measured.

representative `property` ¶

representative: Measurement

The min-peak run — the one the headline peak/allocations/total_bytes come from.

peak_bytes `property` ¶

peak_bytes: int

The headline peak — the minimum high-water across repeats (the cleanest floor).

peak_bytes_max `property` ¶

peak_bytes_max: int

The worst peak across repeats (equals :attr:peak_bytes with one repeat).

allocations `property` ¶

allocations: int

Allocation count from the representative (min-peak) run.

total_bytes `property` ¶

total_bytes: int

Total bytes allocated by the representative (min-peak) run.

series ¶

series(field: str) -> list[int]

The per-repeat values of one :data:SERIES_FIELDS field.

as_dict ¶

as_dict() -> dict[str, Any]

The JSON blob stored under pytest-benchmark extra_info["benchmem"].

Just the three per-repeat series, flat — no denormalized scalars and no repeats (it's len of any series). Everything else derives on read.

from_blob `classmethod` ¶

from_blob(blob: Mapping[str, Any]) -> MemoryResult

Rebuild from a blob's per-repeat series (one column per :data:SERIES_FIELDS).

Measurement `dataclass` ¶

One repeat's raw memray stats numbers — peak high-water, allocation count, and total bytes allocated (cumulative churn, incl. temporaries GC later frees).

Readers & loader — `pytest_benchmem`¶

from_pytest_benchmark reads timing (seconds, from stats); memory_from_pytest_benchmark reads memory (bytes, from extra_info.benchmem). load_samples is the unified reader, and load_long_df stacks runs into the tidy frame every plot pivots.

from_pytest_benchmark ¶

from_pytest_benchmark(
    path: str | Path, *, metric: str = "min"
) -> tuple[str, list[Sample], str]

Read timing out of a pytest-benchmark file → (label, samples, "s").

metric picks the stat (min / median / …). Dims come from each benchmark's parametrize params and extra_info, plus the structural node.* dims (see :func:_node_dims).

memory_from_pytest_benchmark ¶

memory_from_pytest_benchmark(
    path: str | Path,
    *,
    field: str = "peak_bytes",
    reduce: Callable[[list[float]], float] | None = None,
) -> tuple[str, list[Sample], str]

Read memory out of a pytest-benchmark file → (label, samples, unit).

The benchmark_memory fixture stores each run's memory blob under extra_info["benchmem"] (a flat per-repeat series per field), keyed by the same benchmark id pytest-benchmark uses. field picks which series (peak_bytes → unit B, allocations → count). Without reduce the headline scalar is derived (peak = min, allocations/total_bytes = the min-peak run); pass reduce to compute a distribution stat over the series instead. Benchmarks lacking the blob (timing-only tests) are skipped. Dims come from parametrize params and extra_info, plus the structural node.* dims (see :func:_node_dims).

load_samples ¶

load_samples(
    path: str | Path,
    *,
    metric: Metric = "time",
    stat: str | None = None,
) -> tuple[str, list[Sample], str]

Read one pytest-benchmark file for the chosen metric → (label, samples, unit).

stat selects a distribution stat over the metric's per-repeat series (min / max / mean / median / stddev); None reads the headline scalar. For time it picks the pytest-benchmark stat (defaulting to min).

load_long_df ¶

load_long_df(
    runs: str | Path | Sequence[str | Path],
    *,
    metric: Metric = "time",
    stat: str | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[pd.DataFrame, str]

Stack pytest-benchmark files (one path or a sequence) into one long frame → (df, unit).

One row per (run, id) for the chosen metric. Columns: snapshot (the series/version label), id, value, then one column per dim key seen (missing dims are NaN). Every plot view pivots this frame.

labels overrides the snapshot label per run (one per path, same order), decoupling the display name from the filename; defaults to each file's stem.

discover_runs ¶

discover_runs(
    root: str | Path = ".benchmarks",
) -> list[Path]

Return pytest-benchmark JSON files under root (for CLI suggestions).

human_bytes ¶

human_bytes(n: float) -> str

Auto-scale a byte count to a short IEC string: 932 B, 4.1 MiB, 2.3 GiB.

Sample ¶

Bases: NamedTuple

One measured result: an opaque id, a value, and analysis dims.

Plotting — `pytest_benchmem.plotting`¶

Every plot_* returns (figure, n_ids). snapshots is a list of run JSON paths; labels names the series per run (defaults to the file stems) — the API behind plot's -l/--label.

plot_scaling ¶

plot_scaling(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    x: str | None = None,
    color: str | None = None,
    facet: str | None = None,
    log: bool | Literal["auto"] = "auto",
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Cost vs a numeric dim, coloured/faceted by other dims.

x/color/facet default to inference from the dims (the lone numeric dim → x); pass them to override. log="auto" log-scales when x is numeric and strictly positive. labels names the snapshot in the title (defaults to the file stem).

plot_scatter ¶

plot_scatter(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    facet: str | None = None,
    clip: float | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Baseline cost (log-x) vs candidate/baseline ratio (log-y).

Top-right = slow and slower (the regressed corner). The first snapshot is the baseline; with 3+, the rest animate. Colour encodes absolute Δ; clip clamps it (default p95). facet splits by any dim. labels names the snapshots (defaults to the file stems).

plot_compare ¶

plot_compare(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    sort: SortMode = "absolute",
    facet: str | None = None,
    clip: float | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Bar chart of per-id delta, sorted by the chosen Δ (biggest regressions on top).

sort picks the bar dimension: absolute plots b - a in the native unit, relative plots percent change. facet splits into subplots by any dim. clip clamps the colour scale (default symmetric p95). labels names the two series (defaults to the file stems).

plot_sweep ¶

plot_sweep(
    snapshots: Snapshots,
    *,
    metric: Metric = "time",
    clip: float | None = None,
    labels: Sequence[str] | None = None,
) -> tuple[Figure, int]

Heatmap of per-id fold-change (log2 ratio) vs the first snapshot.

labels names the version columns (defaults to the file stems).

Sweeps — `pytest_benchmem.sweep`¶

See Cross-version sweeps for the narrative walkthrough and the Venv the run callback receives.

sweep ¶

sweep(
    versions: Sequence[str],
    run: Callable[[Venv], None],
    **provision_kwargs: object,
) -> list[str]

Provision a venv per version and call run(venv) in each.

run does whatever the consumer needs (invoke pytest / a memory command with venv.python and cwd=venv.cwd). Returns the list of versions that failed to provision.

provision ¶

provision(
    versions: Sequence[str],
    *,
    install_spec: Callable[[str], str] = lambda v: v,
    pins: Sequence[str] = (),
    copy_dir: str | Path | None = None,
    import_check: str | None = None,
    as_of: str | None = None,
    tmp_prefix: str = "pytest-benchmem-",
) -> Iterator[Venv]

Yield one fresh uv venv per version.

install_spec(v) → the pip spec for the package under test (default: v verbatim, so "pkg==1.2" works). pins are extra specs installed alongside. copy_dir is copied into each venv's cwd for import isolation. import_check is a module name asserted to resolve to the venv (preflight). as_of (YYYY-MM-DD) freezes the whole resolution via --exclude-newer.

Venv `dataclass` ¶

One provisioned per-version venv (or a failure).

failure `classmethod` ¶

failure(version: str, stage: str) -> Venv

A provisioning failure at stage — no python/env/cwd, just the reason.

Python API¶

Engine — pytest_benchmem¶

measure_peak ¶

measure_memory ¶

MemoryResult dataclass ¶

repeats property ¶

representative property ¶

peak_bytes property ¶

peak_bytes_max property ¶

allocations property ¶

total_bytes property ¶