chore: benchmark suite — fix harness, one-folder consolidation, measurement integrity + GIL + unified runner by 27Bslash6 · Pull Request #188 · cachekit-io/cachekit-py

27Bslash6 · 2026-06-19T01:06:29Z

Summary

Turns the benchmark harness into a usable, trustworthy, one-folder perf system. Three commits:

Fix the broken harness — make benchmark/benchmark-full pointed at a non-existent benchmarks.cli (prototype leftover); tests/benchmarks/ was orphaned (never matched pytest discovery). Wired native pytest-benchmark with a median:10% regression gate.
Consolidate to one folder — all perf checks now live in tests/performance/. The guard-vs-tracker distinction is enforced by mechanism, not folders: pytest-benchmark tests are selected with --benchmark-only and skipped in normal runs via the --benchmark-skip default. Removes the earlier -o python_files/python_classes discovery hack.
Robust systems — measurement integrity + GIL scaling + a unified runner.

What's here

Folder consolidation

tests/benchmarks/benchmark_serializer_integrity.py → tests/performance/test_serializer_microbench.py (rename preserved). tests/benchmarks/ deleted.
File-scoped Redis no-op override lives in the microbench file (the folder has Redis-dependent tests, so it can't go in a folder conftest).
Dropped the orphaned Blake3OverheadAnalysis print-only methods.

Measurement integrity (measurement_env.py + conftest.py)

Stable system fingerprint (hash over deterministic fields only — excludes patch version / current freq).
Environment pre-flight: thermal-under-load, CPU, memory, loadavg (warn-only; thermal gated on load to avoid idle false-positives).
Timer self-calibration gate (test_measurement_calibration.py) — "is my instrument honest?" before trusting any number.
Reuses stats_utils; supersedes the duplicate fingerprint fixture in test_statistical_rigor.py.

GIL scaling (gil_benchmark.py)

Thread-scaling of StandardSerializer.serialize under the current interpreter (speedup + efficiency%). On this machine it shows negative scaling (~100% → 11% efficiency) — serialize is GIL-bound today.
The no-GIL arm is interpreter-driven and ecosystem-blocked right now: PyO3 < 3.14 has no free-threaded support, and orjson / numpy / pandas / pyarrow lack free-threaded wheels. No code change needed once a free-threaded cachekit installs. (A separate PR will make orjson optional/lazy, the first of those blockers.)

Unified runner (Makefile)

make perf — env fingerprint + timer calibration → serializer benchmarks → GIL scaling (informational).
make perf-compare — regression gate (median:10%).
make benchmark / benchmark-compare / benchmark-gil remain for the individual pieces.

Design notes

median over mean, 10% threshold measured to sit above this suite's ~6% run-to-run median noise.
Not a CI gate, by design — baselines live in .benchmarks/ (gitignored, per-machine); wall-clock benchmarks deliberately don't gate CI (consistent with the existing memory-invariant-only CI policy).
Explicitly not a port of the prototype's bloat (D1 store, dashboards, ensemble-ML detectors, 91KB CLI).

Verification

make perf → env+calibration PASS, 16 benchmarks (67 others correctly skipped), GIL table; exits 0.
make perf-compare → gate passes, propagates non-zero on regression.
Normal suite skips the benchmark-fixture tests via --benchmark-skip; the CI -m "performance and slow" memory step stays green with the new autouse conftest.
ruff check + format + basedpyright clean on all new/changed files.
Rebased onto current main (666d09c).

coderabbitai · 2026-06-19T01:06:44Z

Walkthrough

The PR replaces the previous python -m benchmarks.cli benchmark commands with pytest-benchmark-driven make benchmark and make benchmark-compare targets. A new BENCHMARK_PYTEST variable configures pytest to discover Benchmark* classes in tests/benchmarks/benchmark_*.py. A benchmark-scoped conftest fixture suppresses Redis setup. Developer documentation is updated to match the new commands.

Changes

pytest-benchmark Migration

Layer / File(s)	Summary
Makefile benchmark targets and conftest isolation fixture `Makefile`, `tests/benchmarks/conftest.py`	Introduces `BENCHMARK_PYTEST` variable targeting `tests/benchmarks/benchmark_.py` with `Benchmark` class restriction and GC disabled. The `benchmark` target runs with `--benchmark-autosave` to store a local baseline; the new `benchmark-compare` target re-runs against that baseline and fails on >10% median regression. Old `benchmark` (quick CLI) and `benchmark-full` targets are removed. A new `autouse` fixture in `tests/benchmarks/conftest.py` overrides the root Redis-isolation fixture with a no-op so benchmarks execute without a Redis backend.
Documentation updates `CONTRIBUTING.md`, `docs/getting-started.md`, `tests/performance/README.md`	Replaces references to the old `benchmarks.cli` and `make test-performance-quick` commands with `make benchmark`, `make benchmark-compare`, and the equivalent `uv run pytest` invocation. Documents that baselines are stored in `.benchmarks/` (gitignored) and that regression comparisons are a local developer tool only.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately describes the main changes: fixing the broken benchmark harness, consolidating to one folder, and improving measurement integrity with GIL analysis and unified runner.
Description check	✅ Passed	The description is comprehensive, covering summary, motivation, what's included, design notes, and verification steps, though it lacks formal checklist completion as per the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/benchmark-harness

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/benchmarks/conftest.py`:
- Around line 13-20: The function setup_di_for_redis_isolation() is missing a
return type hint annotation. Add an explicit return type hint of Generator[None,
None, None] to the function signature to satisfy the repository's typing
requirements for Python APIs. Since this is a pytest fixture that yields without
providing any value, this type annotation accurately represents the fixture's
behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 26581e89-2e4f-41d6-a4fa-b6731c939512

📥 Commits

Reviewing files that changed from the base of the PR and between c453288 and eac2dd9.

📒 Files selected for processing (5)

CONTRIBUTING.md
Makefile
docs/getting-started.md
tests/benchmarks/conftest.py
tests/performance/README.md

Add `-> Iterator[None]` to the setup_di_for_redis_isolation override, matching the existing precedent for the same fixture in tests/unit/test_wrapper_lock_bare_key.py. Addresses a PR #188 review note.

…chmark targets `make benchmark`/`make benchmark-full` pointed at a `benchmarks.cli` package that does not exist in this repo (a leftover from the pyredis-cache-pro prototype), so both failed on invocation. tests/benchmarks/ was also orphaned: its files (benchmark_*.py / Benchmark* classes) never matched default pytest discovery, so the only pytest-benchmark suite in the repo ran nowhere. - Makefile: `make benchmark` runs tests/benchmarks/ with --benchmark-autosave; new `make benchmark-compare` fails on >10% median regression vs the last saved baseline. Median over mean for outlier-robustness; 10% sits above the ~6% run-to-run median noise measured on this suite. Native --benchmark-save/--benchmark-compare-fail — no custom stats code. - tests/benchmarks/conftest.py: no-op override of the root autouse Redis-isolation fixture (serializer benchmarks need no backend), mirroring tests/unit/conftest.py. - Baselines live in .benchmarks/ (already gitignored, per-machine), so regression comparison is a local developer tool; wall-clock benchmarks deliberately do not gate CI. - Fix doc drift still pointing at the removed benchmarks.cli: CONTRIBUTING.md, docs/getting-started.md, tests/performance/README.md.

Add `-> Iterator[None]` to the setup_di_for_redis_isolation override, matching the existing precedent for the same fixture in tests/unit/test_wrapper_lock_bare_key.py. Addresses a PR #188 review note.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Makefile`:
- Around line 380-387: The benchmark-compare target currently exceeds the
configured five-line checkmake limit with seven lines of logging and error
handling. Extract the error message logging and conditional output logic (the
echo statements for failure and success, and the exit 1) into a separate helper
target or shell script function, then simplify benchmark-compare to call this
helper after running the benchmark comparison command. This will reduce
benchmark-compare below the five-line threshold while preserving all the current
logging and error handling behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2ee47dee-d797-4aac-acd6-d2a999d02e54

📥 Commits

Reviewing files that changed from the base of the PR and between 54e63f1 and 68a1df1.

📒 Files selected for processing (5)

CONTRIBUTING.md
Makefile
docs/getting-started.md
tests/benchmarks/conftest.py
tests/performance/README.md

coderabbitai · 2026-06-19T12:05:21Z

+benchmark-compare: setup-logs ## Run benchmarks, fail on >10% median regression vs last saved baseline
+	@echo "$(BLUE)Comparing against last saved baseline (run 'make benchmark' first)...$(RESET)"
+	@if ! $(BENCHMARK_PYTEST) --benchmark-compare --benchmark-compare-fail=median:10% 2>&1 | tee $(LOG_BENCHMARK_DIR)/bench_compare_$(TIMESTAMP).log; then \
+		echo "$(YELLOW)❌ benchmark-compare failed: median regression >10% vs baseline, or no baseline yet (run 'make benchmark' first).$(RESET)"; \
+		echo "$(YELLOW)   Threshold is tunable — set it above your machine's run-to-run median noise (~6% here).$(RESET)"; \
+		exit 1; \
+	fi
+	@echo "$(GREEN)✓ No median regression beyond 10% vs baseline$(RESET)"


🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Keep benchmark-compare under the Make lint limit.

This recipe is seven lines, so checkmake will flag it against the configured five-line max. Folding the logging/error handling into a helper target or script would keep the lint green.

🧰 Tools

🪛 checkmake (0.3.2)

[warning] 380-380: Target body for "benchmark-compare" exceeds allowed length of 5 lines (7).

(maxbodylength)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Makefile` around lines 380 - 387, The benchmark-compare target currently exceeds the configured five-line checkmake limit with seven lines of logging and error handling. Extract the error message logging and conditional output logic (the echo statements for failure and success, and the exit 1) into a separate helper target or shell script function, then simplify benchmark-compare to call this helper after running the benchmark comparison command. This will reduce benchmark-compare below the five-line threshold while preserving all the current logging and error handling behavior.

Source: Linters/SAST tools

…fied runner Builds on the benchmark harness (#188) with the robust-systems layer the pyredis-cache-pro prototype had but the rewrite lacked, and consolidates the two perf folders into one. Folder consolidation: move tests/benchmarks/benchmark_serializer_integrity.py into tests/performance/test_serializer_microbench.py (Test* classes); delete tests/benchmarks/. Guard-vs-tracker is now separated by mechanism, not folder — pytest-benchmark tests are selected with --benchmark-only and skipped in normal runs via the --benchmark-skip default (pyproject addopts), dropping the prior -o python_files/python_classes discovery hack. A file-scoped Redis no-op override lives in the microbench file (tests/performance/ has Redis-dependent tests). Drops the orphaned Blake3OverheadAnalysis print-only methods. Measurement integrity (measurement_env.py + conftest.py): stable system fingerprint (hash over deterministic fields only), environment pre-flight check (thermal-under- load / CPU / memory / loadavg), and a timer self-calibration gate (test_measurement_calibration.py), surfaced as a warn-only session banner. Reuses stats_utils; supersedes the duplicate fingerprint fixture in test_statistical_rigor.py. GIL scaling (gil_benchmark.py): thread-scaling of StandardSerializer.serialize under the current interpreter (speedup + efficiency%). The no-GIL arm is interpreter-driven and currently ecosystem-blocked (PyO3<3.14; orjson/numpy/pandas/pyarrow lack free- threaded wheels) — no code change needed once a free-threaded cachekit installs. Unified runner (Makefile): `make perf` runs the battery (env+calibration → serializer benchmarks → GIL scaling); `make perf-compare` gates on >10% median regression. Docs updated (CONTRIBUTING.md, tests/performance/README.md).

coderabbitai Bot requested changes Jun 19, 2026

View reviewed changes

Comment thread tests/benchmarks/conftest.py Outdated

coderabbitai Bot approved these changes Jun 19, 2026

View reviewed changes

27Bslash6 added 2 commits June 19, 2026 21:59

chore: annotate benchmark conftest fixture return type

68a1df1

Add `-> Iterator[None]` to the setup_di_for_redis_isolation override, matching the existing precedent for the same fixture in tests/unit/test_wrapper_lock_bare_key.py. Addresses a PR #188 review note.

27Bslash6 force-pushed the feat/benchmark-harness branch from 54e63f1 to 68a1df1 Compare June 19, 2026 11:59

coderabbitai Bot requested changes Jun 19, 2026

View reviewed changes

27Bslash6 changed the title ~~chore: wire native pytest-benchmark regression gating; fix broken benchmark targets~~ chore: benchmark suite — fix harness, one-folder consolidation, measurement integrity + GIL + unified runner Jun 19, 2026

27Bslash6 mentioned this pull request Jun 19, 2026

chore: adopt checkmake + extract procedural Makefile targets to scripts/ #193

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: benchmark suite — fix harness, one-folder consolidation, measurement integrity + GIL + unified runner#188

chore: benchmark suite — fix harness, one-folder consolidation, measurement integrity + GIL + unified runner#188
27Bslash6 wants to merge 3 commits into
mainfrom
feat/benchmark-harness

27Bslash6 commented Jun 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

27Bslash6 commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's here

Design notes

Verification

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

27Bslash6 commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading