Skip to content

chore: benchmark suite — fix harness, one-folder consolidation, measurement integrity + GIL + unified runner#188

Open
27Bslash6 wants to merge 3 commits into
mainfrom
feat/benchmark-harness
Open

chore: benchmark suite — fix harness, one-folder consolidation, measurement integrity + GIL + unified runner#188
27Bslash6 wants to merge 3 commits into
mainfrom
feat/benchmark-harness

Conversation

@27Bslash6

@27Bslash6 27Bslash6 commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Turns the benchmark harness into a usable, trustworthy, one-folder perf system. Three commits:

  1. Fix the broken harnessmake benchmark/benchmark-full pointed at a non-existent benchmarks.cli (prototype leftover); tests/benchmarks/ was orphaned (never matched pytest discovery). Wired native pytest-benchmark with a median:10% regression gate.
  2. Consolidate to one folder — all perf checks now live in tests/performance/. The guard-vs-tracker distinction is enforced by mechanism, not folders: pytest-benchmark tests are selected with --benchmark-only and skipped in normal runs via the --benchmark-skip default. Removes the earlier -o python_files/python_classes discovery hack.
  3. Robust systems — measurement integrity + GIL scaling + a unified runner.

What's here

Folder consolidation

  • tests/benchmarks/benchmark_serializer_integrity.pytests/performance/test_serializer_microbench.py (rename preserved). tests/benchmarks/ deleted.
  • File-scoped Redis no-op override lives in the microbench file (the folder has Redis-dependent tests, so it can't go in a folder conftest).
  • Dropped the orphaned Blake3OverheadAnalysis print-only methods.

Measurement integrity (measurement_env.py + conftest.py)

  • Stable system fingerprint (hash over deterministic fields only — excludes patch version / current freq).
  • Environment pre-flight: thermal-under-load, CPU, memory, loadavg (warn-only; thermal gated on load to avoid idle false-positives).
  • Timer self-calibration gate (test_measurement_calibration.py) — "is my instrument honest?" before trusting any number.
  • Reuses stats_utils; supersedes the duplicate fingerprint fixture in test_statistical_rigor.py.

GIL scaling (gil_benchmark.py)

  • Thread-scaling of StandardSerializer.serialize under the current interpreter (speedup + efficiency%). On this machine it shows negative scaling (~100% → 11% efficiency) — serialize is GIL-bound today.
  • The no-GIL arm is interpreter-driven and ecosystem-blocked right now: PyO3 < 3.14 has no free-threaded support, and orjson / numpy / pandas / pyarrow lack free-threaded wheels. No code change needed once a free-threaded cachekit installs. (A separate PR will make orjson optional/lazy, the first of those blockers.)

Unified runner (Makefile)

  • make perf — env fingerprint + timer calibration → serializer benchmarks → GIL scaling (informational).
  • make perf-compare — regression gate (median:10%).
  • make benchmark / benchmark-compare / benchmark-gil remain for the individual pieces.

Design notes

  • median over mean, 10% threshold measured to sit above this suite's ~6% run-to-run median noise.
  • Not a CI gate, by design — baselines live in .benchmarks/ (gitignored, per-machine); wall-clock benchmarks deliberately don't gate CI (consistent with the existing memory-invariant-only CI policy).
  • Explicitly not a port of the prototype's bloat (D1 store, dashboards, ensemble-ML detectors, 91KB CLI).

Verification

  • make perf → env+calibration PASS, 16 benchmarks (67 others correctly skipped), GIL table; exits 0.
  • make perf-compare → gate passes, propagates non-zero on regression.
  • Normal suite skips the benchmark-fixture tests via --benchmark-skip; the CI -m "performance and slow" memory step stays green with the new autouse conftest.
  • ruff check + format + basedpyright clean on all new/changed files.
  • Rebased onto current main (666d09c).

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

Walkthrough

The PR replaces the previous python -m benchmarks.cli benchmark commands with pytest-benchmark-driven make benchmark and make benchmark-compare targets. A new BENCHMARK_PYTEST variable configures pytest to discover Benchmark* classes in tests/benchmarks/benchmark_*.py. A benchmark-scoped conftest fixture suppresses Redis setup. Developer documentation is updated to match the new commands.

Changes

pytest-benchmark Migration

Layer / File(s) Summary
Makefile benchmark targets and conftest isolation fixture
Makefile, tests/benchmarks/conftest.py
Introduces BENCHMARK_PYTEST variable targeting tests/benchmarks/benchmark_*.py with Benchmark* class restriction and GC disabled. The benchmark target runs with --benchmark-autosave to store a local baseline; the new benchmark-compare target re-runs against that baseline and fails on >10% median regression. Old benchmark (quick CLI) and benchmark-full targets are removed. A new autouse fixture in tests/benchmarks/conftest.py overrides the root Redis-isolation fixture with a no-op so benchmarks execute without a Redis backend.
Documentation updates
CONTRIBUTING.md, docs/getting-started.md, tests/performance/README.md
Replaces references to the old benchmarks.cli and make test-performance-quick commands with make benchmark, make benchmark-compare, and the equivalent uv run pytest invocation. Documents that baselines are stored in .benchmarks/ (gitignored) and that regression comparisons are a local developer tool only.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately describes the main changes: fixing the broken benchmark harness, consolidating to one folder, and improving measurement integrity with GIL analysis and unified runner.
Description check ✅ Passed The description is comprehensive, covering summary, motivation, what's included, design notes, and verification steps, though it lacks formal checklist completion as per the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/benchmark-harness

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/benchmarks/conftest.py`:
- Around line 13-20: The function setup_di_for_redis_isolation() is missing a
return type hint annotation. Add an explicit return type hint of Generator[None,
None, None] to the function signature to satisfy the repository's typing
requirements for Python APIs. Since this is a pytest fixture that yields without
providing any value, this type annotation accurately represents the fixture's
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 26581e89-2e4f-41d6-a4fa-b6731c939512

📥 Commits

Reviewing files that changed from the base of the PR and between c453288 and eac2dd9.

📒 Files selected for processing (5)
  • CONTRIBUTING.md
  • Makefile
  • docs/getting-started.md
  • tests/benchmarks/conftest.py
  • tests/performance/README.md

Comment thread tests/benchmarks/conftest.py Outdated
27Bslash6 added a commit that referenced this pull request Jun 19, 2026
Add `-> Iterator[None]` to the setup_di_for_redis_isolation override,
matching the existing precedent for the same fixture in
tests/unit/test_wrapper_lock_bare_key.py. Addresses a PR #188 review note.
…chmark targets

`make benchmark`/`make benchmark-full` pointed at a `benchmarks.cli` package that
does not exist in this repo (a leftover from the pyredis-cache-pro prototype), so
both failed on invocation. tests/benchmarks/ was also orphaned: its files
(benchmark_*.py / Benchmark* classes) never matched default pytest discovery, so
the only pytest-benchmark suite in the repo ran nowhere.

- Makefile: `make benchmark` runs tests/benchmarks/ with --benchmark-autosave;
  new `make benchmark-compare` fails on >10% median regression vs the last saved
  baseline. Median over mean for outlier-robustness; 10% sits above the ~6%
  run-to-run median noise measured on this suite. Native
  --benchmark-save/--benchmark-compare-fail — no custom stats code.
- tests/benchmarks/conftest.py: no-op override of the root autouse Redis-isolation
  fixture (serializer benchmarks need no backend), mirroring tests/unit/conftest.py.
- Baselines live in .benchmarks/ (already gitignored, per-machine), so regression
  comparison is a local developer tool; wall-clock benchmarks deliberately do not
  gate CI.
- Fix doc drift still pointing at the removed benchmarks.cli: CONTRIBUTING.md,
  docs/getting-started.md, tests/performance/README.md.
Add `-> Iterator[None]` to the setup_di_for_redis_isolation override,
matching the existing precedent for the same fixture in
tests/unit/test_wrapper_lock_bare_key.py. Addresses a PR #188 review note.
@27Bslash6 27Bslash6 force-pushed the feat/benchmark-harness branch from 54e63f1 to 68a1df1 Compare June 19, 2026 11:59

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Makefile`:
- Around line 380-387: The benchmark-compare target currently exceeds the
configured five-line checkmake limit with seven lines of logging and error
handling. Extract the error message logging and conditional output logic (the
echo statements for failure and success, and the exit 1) into a separate helper
target or shell script function, then simplify benchmark-compare to call this
helper after running the benchmark comparison command. This will reduce
benchmark-compare below the five-line threshold while preserving all the current
logging and error handling behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2ee47dee-d797-4aac-acd6-d2a999d02e54

📥 Commits

Reviewing files that changed from the base of the PR and between 54e63f1 and 68a1df1.

📒 Files selected for processing (5)
  • CONTRIBUTING.md
  • Makefile
  • docs/getting-started.md
  • tests/benchmarks/conftest.py
  • tests/performance/README.md

Comment thread Makefile
Comment on lines +380 to +387
benchmark-compare: setup-logs ## Run benchmarks, fail on >10% median regression vs last saved baseline
@echo "$(BLUE)Comparing against last saved baseline (run 'make benchmark' first)...$(RESET)"
@if ! $(BENCHMARK_PYTEST) --benchmark-compare --benchmark-compare-fail=median:10% 2>&1 | tee $(LOG_BENCHMARK_DIR)/bench_compare_$(TIMESTAMP).log; then \
echo "$(YELLOW)❌ benchmark-compare failed: median regression >10% vs baseline, or no baseline yet (run 'make benchmark' first).$(RESET)"; \
echo "$(YELLOW) Threshold is tunable — set it above your machine's run-to-run median noise (~6% here).$(RESET)"; \
exit 1; \
fi
@echo "$(GREEN)✓ No median regression beyond 10% vs baseline$(RESET)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Keep benchmark-compare under the Make lint limit.

This recipe is seven lines, so checkmake will flag it against the configured five-line max. Folding the logging/error handling into a helper target or script would keep the lint green.

🧰 Tools
🪛 checkmake (0.3.2)

[warning] 380-380: Target body for "benchmark-compare" exceeds allowed length of 5 lines (7).

(maxbodylength)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 380 - 387, The benchmark-compare target currently
exceeds the configured five-line checkmake limit with seven lines of logging and
error handling. Extract the error message logging and conditional output logic
(the echo statements for failure and success, and the exit 1) into a separate
helper target or shell script function, then simplify benchmark-compare to call
this helper after running the benchmark comparison command. This will reduce
benchmark-compare below the five-line threshold while preserving all the current
logging and error handling behavior.

Source: Linters/SAST tools

…fied runner

Builds on the benchmark harness (#188) with the robust-systems layer the
pyredis-cache-pro prototype had but the rewrite lacked, and consolidates the two
perf folders into one.

Folder consolidation: move tests/benchmarks/benchmark_serializer_integrity.py into
tests/performance/test_serializer_microbench.py (Test* classes); delete
tests/benchmarks/. Guard-vs-tracker is now separated by mechanism, not folder —
pytest-benchmark tests are selected with --benchmark-only and skipped in normal runs
via the --benchmark-skip default (pyproject addopts), dropping the prior
-o python_files/python_classes discovery hack. A file-scoped Redis no-op override
lives in the microbench file (tests/performance/ has Redis-dependent tests). Drops
the orphaned Blake3OverheadAnalysis print-only methods.

Measurement integrity (measurement_env.py + conftest.py): stable system fingerprint
(hash over deterministic fields only), environment pre-flight check (thermal-under-
load / CPU / memory / loadavg), and a timer self-calibration gate
(test_measurement_calibration.py), surfaced as a warn-only session banner. Reuses
stats_utils; supersedes the duplicate fingerprint fixture in test_statistical_rigor.py.

GIL scaling (gil_benchmark.py): thread-scaling of StandardSerializer.serialize under
the current interpreter (speedup + efficiency%). The no-GIL arm is interpreter-driven
and currently ecosystem-blocked (PyO3<3.14; orjson/numpy/pandas/pyarrow lack free-
threaded wheels) — no code change needed once a free-threaded cachekit installs.

Unified runner (Makefile): `make perf` runs the battery (env+calibration → serializer
benchmarks → GIL scaling); `make perf-compare` gates on >10% median regression. Docs
updated (CONTRIBUTING.md, tests/performance/README.md).
@27Bslash6 27Bslash6 changed the title chore: wire native pytest-benchmark regression gating; fix broken benchmark targets chore: benchmark suite — fix harness, one-folder consolidation, measurement integrity + GIL + unified runner Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant