Python API¶
For the pytest flags, the benchmem marker, the benchmark_memory fixture, the blob
schema, and the benchmem CLI, see the Reference; for narrative
usage start with Getting started.
pytest_benchmem is light to import — it re-exports only the engine and the
readers/loader. pytest_benchmem.plotting pulls plotly and pytest_benchmem.sweep
shells out to uv, so import those submodules directly.
Engine — pytest_benchmem¶
measure_peak ¶
Run action() under memray.Tracker and return peak bytes.
The bare one-liner for a REPL or notebook; :func:measure_memory returns the
full result (allocation count, spread).
measure_memory ¶
Run action() under memray.Tracker repeats times → :class:MemoryResult.
Each repeat gets a fresh tracker; the headline peak is the minimum across repeats
(see :class:MemoryResult), and every repeat's :class:Measurement is retained for
distribution stats.
MemoryResult
dataclass
¶
A memory measurement across repeats passes, derived from the per-repeat samples.
The per-repeat :attr:samples are the single source of truth — that's all the blob
stores (the series); everything else is derived from them on read.
Peak memory is noisier than expected (GC timing, lazy imports, page cache), so the
headline :attr:peak_bytes is the minimum peak — the cleanest "floor this can hit" —
and :attr:allocations / :attr:total_bytes come from that same representative
(min-peak) run, a coherent snapshot. :attr:peak_bytes_max is the worst peak, so the
spread is visible.
representative
property
¶
The min-peak run — the one the headline peak/allocations/total_bytes come from.
peak_bytes
property
¶
The headline peak — the minimum high-water across repeats (the cleanest floor).
peak_bytes_max
property
¶
The worst peak across repeats (equals :attr:peak_bytes with one repeat).
as_dict ¶
The JSON blob stored under pytest-benchmark extra_info["benchmem"].
Just the three per-repeat series, flat — no denormalized scalars and no
repeats (it's len of any series). Everything else derives on read.
from_blob
classmethod
¶
Rebuild from a blob's per-repeat series (one column per :data:SERIES_FIELDS).
Measurement
dataclass
¶
One repeat's raw memray stats numbers — peak high-water, allocation count,
and total bytes allocated (cumulative churn, incl. temporaries GC later frees).
Readers & loader — pytest_benchmem¶
from_pytest_benchmark reads timing (seconds, from stats);
memory_from_pytest_benchmark reads memory (bytes, from extra_info.benchmem).
load_samples is the unified reader, and load_long_df stacks runs into the tidy frame
every plot pivots.
from_pytest_benchmark ¶
Read timing out of a pytest-benchmark file → (label, samples, "s").
metric picks the stat (min / median / …). Dims come from each
benchmark's parametrize params and extra_info, plus the structural
node.* dims (see :func:_node_dims).
memory_from_pytest_benchmark ¶
memory_from_pytest_benchmark(
path: str | Path,
*,
field: str = "peak_bytes",
reduce: Callable[[list[float]], float] | None = None,
) -> tuple[str, list[Sample], str]
Read memory out of a pytest-benchmark file → (label, samples, unit).
The benchmark_memory fixture stores each run's memory blob under
extra_info["benchmem"] (a flat per-repeat series per field), keyed by the same
benchmark id pytest-benchmark uses. field picks which series (peak_bytes →
unit B, allocations → count). Without reduce the headline scalar is
derived (peak = min, allocations/total_bytes = the min-peak run); pass reduce to
compute a distribution stat over the series instead. Benchmarks lacking the blob
(timing-only tests) are skipped. Dims come from parametrize params and
extra_info, plus the structural node.* dims (see :func:_node_dims).
load_samples ¶
load_samples(
path: str | Path,
*,
metric: Metric = "time",
stat: str | None = None,
) -> tuple[str, list[Sample], str]
Read one pytest-benchmark file for the chosen metric → (label, samples, unit).
stat selects a distribution stat over the metric's per-repeat series (min /
max / mean / median / stddev); None reads the headline scalar.
For time it picks the pytest-benchmark stat (defaulting to min).
load_long_df ¶
load_long_df(
runs: str | Path | Sequence[str | Path],
*,
metric: Metric = "time",
stat: str | None = None,
labels: Sequence[str] | None = None,
) -> tuple[pd.DataFrame, str]
Stack pytest-benchmark files (one path or a sequence) into one long frame → (df, unit).
One row per (run, id) for the chosen metric. Columns: snapshot
(the series/version label), id, value, then one column per dim key seen
(missing dims are NaN). Every plot view pivots this frame.
labels overrides the snapshot label per run (one per path, same order),
decoupling the display name from the filename; defaults to each file's stem.
discover_runs ¶
Return pytest-benchmark JSON files under root (for CLI suggestions).
human_bytes ¶
Auto-scale a byte count to a short IEC string: 932 B, 4.1 MiB, 2.3 GiB.
Sample ¶
Bases: NamedTuple
One measured result: an opaque id, a value, and analysis dims.
Plotting — pytest_benchmem.plotting¶
Every plot_* returns (figure, n_ids). snapshots is a list of run JSON paths;
labels names the series per run (defaults to the file stems) — the API behind plot's
-l/--label.
plot_scaling ¶
plot_scaling(
snapshots: Snapshots,
*,
metric: Metric = "time",
x: str | None = None,
color: str | None = None,
facet: str | None = None,
log: bool | Literal["auto"] = "auto",
labels: Sequence[str] | None = None,
) -> tuple[Figure, int]
Cost vs a numeric dim, coloured/faceted by other dims.
x/color/facet default to inference from the dims (the lone numeric
dim → x); pass them to override. log="auto" log-scales when x is numeric
and strictly positive. labels names the snapshot in the title (defaults to
the file stem).
plot_scatter ¶
plot_scatter(
snapshots: Snapshots,
*,
metric: Metric = "time",
facet: str | None = None,
clip: float | None = None,
labels: Sequence[str] | None = None,
) -> tuple[Figure, int]
Baseline cost (log-x) vs candidate/baseline ratio (log-y).
Top-right = slow and slower (the regressed corner). The first snapshot is
the baseline; with 3+, the rest animate. Colour encodes absolute Δ; clip
clamps it (default p95). facet splits by any dim. labels names the
snapshots (defaults to the file stems).
plot_compare ¶
plot_compare(
snapshots: Snapshots,
*,
metric: Metric = "time",
sort: SortMode = "absolute",
facet: str | None = None,
clip: float | None = None,
labels: Sequence[str] | None = None,
) -> tuple[Figure, int]
Bar chart of per-id delta, sorted by the chosen Δ (biggest regressions on top).
sort picks the bar dimension: absolute plots b - a in the native
unit, relative plots percent change. facet splits into subplots by any
dim. clip clamps the colour scale (default symmetric p95). labels names
the two series (defaults to the file stems).
plot_sweep ¶
plot_sweep(
snapshots: Snapshots,
*,
metric: Metric = "time",
clip: float | None = None,
labels: Sequence[str] | None = None,
) -> tuple[Figure, int]
Heatmap of per-id fold-change (log2 ratio) vs the first snapshot.
labels names the version columns (defaults to the file stems).
Sweeps — pytest_benchmem.sweep¶
See Cross-version sweeps for the narrative walkthrough and the Venv the
run callback receives.
sweep ¶
sweep(
versions: Sequence[str],
run: Callable[[Venv], None],
**provision_kwargs: object,
) -> list[str]
Provision a venv per version and call run(venv) in each.
run does whatever the consumer needs (invoke pytest / a memory command
with venv.python and cwd=venv.cwd). Returns the list of versions
that failed to provision.
provision ¶
provision(
versions: Sequence[str],
*,
install_spec: Callable[[str], str] = lambda v: v,
pins: Sequence[str] = (),
copy_dir: str | Path | None = None,
import_check: str | None = None,
as_of: str | None = None,
tmp_prefix: str = "pytest-benchmem-",
) -> Iterator[Venv]
Yield one fresh uv venv per version.
install_spec(v) → the pip spec for the package under test (default:
v verbatim, so "pkg==1.2" works). pins are extra specs installed
alongside. copy_dir is copied into each venv's cwd for import isolation.
import_check is a module name asserted to resolve to the venv (preflight).
as_of (YYYY-MM-DD) freezes the whole resolution via --exclude-newer.