chore: benchmark suite — fix harness, one-folder consolidation, measurement integrity + GIL + unified runner#188
chore: benchmark suite — fix harness, one-folder consolidation, measurement integrity + GIL + unified runner#18827Bslash6 wants to merge 3 commits into
Conversation
WalkthroughThe PR replaces the previous Changespytest-benchmark Migration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/benchmarks/conftest.py`:
- Around line 13-20: The function setup_di_for_redis_isolation() is missing a
return type hint annotation. Add an explicit return type hint of Generator[None,
None, None] to the function signature to satisfy the repository's typing
requirements for Python APIs. Since this is a pytest fixture that yields without
providing any value, this type annotation accurately represents the fixture's
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 26581e89-2e4f-41d6-a4fa-b6731c939512
📒 Files selected for processing (5)
CONTRIBUTING.mdMakefiledocs/getting-started.mdtests/benchmarks/conftest.pytests/performance/README.md
Add `-> Iterator[None]` to the setup_di_for_redis_isolation override, matching the existing precedent for the same fixture in tests/unit/test_wrapper_lock_bare_key.py. Addresses a PR #188 review note.
…chmark targets `make benchmark`/`make benchmark-full` pointed at a `benchmarks.cli` package that does not exist in this repo (a leftover from the pyredis-cache-pro prototype), so both failed on invocation. tests/benchmarks/ was also orphaned: its files (benchmark_*.py / Benchmark* classes) never matched default pytest discovery, so the only pytest-benchmark suite in the repo ran nowhere. - Makefile: `make benchmark` runs tests/benchmarks/ with --benchmark-autosave; new `make benchmark-compare` fails on >10% median regression vs the last saved baseline. Median over mean for outlier-robustness; 10% sits above the ~6% run-to-run median noise measured on this suite. Native --benchmark-save/--benchmark-compare-fail — no custom stats code. - tests/benchmarks/conftest.py: no-op override of the root autouse Redis-isolation fixture (serializer benchmarks need no backend), mirroring tests/unit/conftest.py. - Baselines live in .benchmarks/ (already gitignored, per-machine), so regression comparison is a local developer tool; wall-clock benchmarks deliberately do not gate CI. - Fix doc drift still pointing at the removed benchmarks.cli: CONTRIBUTING.md, docs/getting-started.md, tests/performance/README.md.
Add `-> Iterator[None]` to the setup_di_for_redis_isolation override, matching the existing precedent for the same fixture in tests/unit/test_wrapper_lock_bare_key.py. Addresses a PR #188 review note.
54e63f1 to
68a1df1
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Makefile`:
- Around line 380-387: The benchmark-compare target currently exceeds the
configured five-line checkmake limit with seven lines of logging and error
handling. Extract the error message logging and conditional output logic (the
echo statements for failure and success, and the exit 1) into a separate helper
target or shell script function, then simplify benchmark-compare to call this
helper after running the benchmark comparison command. This will reduce
benchmark-compare below the five-line threshold while preserving all the current
logging and error handling behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2ee47dee-d797-4aac-acd6-d2a999d02e54
📒 Files selected for processing (5)
CONTRIBUTING.mdMakefiledocs/getting-started.mdtests/benchmarks/conftest.pytests/performance/README.md
| benchmark-compare: setup-logs ## Run benchmarks, fail on >10% median regression vs last saved baseline | ||
| @echo "$(BLUE)Comparing against last saved baseline (run 'make benchmark' first)...$(RESET)" | ||
| @if ! $(BENCHMARK_PYTEST) --benchmark-compare --benchmark-compare-fail=median:10% 2>&1 | tee $(LOG_BENCHMARK_DIR)/bench_compare_$(TIMESTAMP).log; then \ | ||
| echo "$(YELLOW)❌ benchmark-compare failed: median regression >10% vs baseline, or no baseline yet (run 'make benchmark' first).$(RESET)"; \ | ||
| echo "$(YELLOW) Threshold is tunable — set it above your machine's run-to-run median noise (~6% here).$(RESET)"; \ | ||
| exit 1; \ | ||
| fi | ||
| @echo "$(GREEN)✓ No median regression beyond 10% vs baseline$(RESET)" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Keep benchmark-compare under the Make lint limit.
This recipe is seven lines, so checkmake will flag it against the configured five-line max. Folding the logging/error handling into a helper target or script would keep the lint green.
🧰 Tools
🪛 checkmake (0.3.2)
[warning] 380-380: Target body for "benchmark-compare" exceeds allowed length of 5 lines (7).
(maxbodylength)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@Makefile` around lines 380 - 387, The benchmark-compare target currently
exceeds the configured five-line checkmake limit with seven lines of logging and
error handling. Extract the error message logging and conditional output logic
(the echo statements for failure and success, and the exit 1) into a separate
helper target or shell script function, then simplify benchmark-compare to call
this helper after running the benchmark comparison command. This will reduce
benchmark-compare below the five-line threshold while preserving all the current
logging and error handling behavior.
Source: Linters/SAST tools
…fied runner Builds on the benchmark harness (#188) with the robust-systems layer the pyredis-cache-pro prototype had but the rewrite lacked, and consolidates the two perf folders into one. Folder consolidation: move tests/benchmarks/benchmark_serializer_integrity.py into tests/performance/test_serializer_microbench.py (Test* classes); delete tests/benchmarks/. Guard-vs-tracker is now separated by mechanism, not folder — pytest-benchmark tests are selected with --benchmark-only and skipped in normal runs via the --benchmark-skip default (pyproject addopts), dropping the prior -o python_files/python_classes discovery hack. A file-scoped Redis no-op override lives in the microbench file (tests/performance/ has Redis-dependent tests). Drops the orphaned Blake3OverheadAnalysis print-only methods. Measurement integrity (measurement_env.py + conftest.py): stable system fingerprint (hash over deterministic fields only), environment pre-flight check (thermal-under- load / CPU / memory / loadavg), and a timer self-calibration gate (test_measurement_calibration.py), surfaced as a warn-only session banner. Reuses stats_utils; supersedes the duplicate fingerprint fixture in test_statistical_rigor.py. GIL scaling (gil_benchmark.py): thread-scaling of StandardSerializer.serialize under the current interpreter (speedup + efficiency%). The no-GIL arm is interpreter-driven and currently ecosystem-blocked (PyO3<3.14; orjson/numpy/pandas/pyarrow lack free- threaded wheels) — no code change needed once a free-threaded cachekit installs. Unified runner (Makefile): `make perf` runs the battery (env+calibration → serializer benchmarks → GIL scaling); `make perf-compare` gates on >10% median regression. Docs updated (CONTRIBUTING.md, tests/performance/README.md).
Summary
Turns the benchmark harness into a usable, trustworthy, one-folder perf system. Three commits:
make benchmark/benchmark-fullpointed at a non-existentbenchmarks.cli(prototype leftover);tests/benchmarks/was orphaned (never matched pytest discovery). Wired native pytest-benchmark with amedian:10%regression gate.tests/performance/. The guard-vs-tracker distinction is enforced by mechanism, not folders: pytest-benchmark tests are selected with--benchmark-onlyand skipped in normal runs via the--benchmark-skipdefault. Removes the earlier-o python_files/python_classesdiscovery hack.What's here
Folder consolidation
tests/benchmarks/benchmark_serializer_integrity.py→tests/performance/test_serializer_microbench.py(rename preserved).tests/benchmarks/deleted.Blake3OverheadAnalysisprint-only methods.Measurement integrity (
measurement_env.py+conftest.py)test_measurement_calibration.py) — "is my instrument honest?" before trusting any number.stats_utils; supersedes the duplicate fingerprint fixture intest_statistical_rigor.py.GIL scaling (
gil_benchmark.py)StandardSerializer.serializeunder the current interpreter (speedup + efficiency%). On this machine it shows negative scaling (~100% → 11% efficiency) — serialize is GIL-bound today.Unified runner (
Makefile)make perf— env fingerprint + timer calibration → serializer benchmarks → GIL scaling (informational).make perf-compare— regression gate (median:10%).make benchmark/benchmark-compare/benchmark-gilremain for the individual pieces.Design notes
.benchmarks/(gitignored, per-machine); wall-clock benchmarks deliberately don't gate CI (consistent with the existing memory-invariant-only CI policy).Verification
make perf→ env+calibration PASS, 16 benchmarks (67 others correctly skipped), GIL table; exits 0.make perf-compare→ gate passes, propagates non-zero on regression.--benchmark-skip; the CI-m "performance and slow"memory step stays green with the new autouse conftest.main(666d09c).