PLT-458: Open-loop scheduler (replace closed-loop dequeue)#48
PLT-458: Open-loop scheduler (replace closed-loop dequeue)#48bdchatham wants to merge 7 commits into
Conversation
Make transaction arrival open-loop to fix coordinated omission: tx i is issued at t₀ + i/λ independent of in-flight completion, so a slow SUT no longer slows the generator and hides backlog in latency. - sender/scheduler.go: openLoopScheduler owns t₀ and the monotonic sequence index i, derives λ from the shared rate.Limiter as a clock source (sampled per tick to honor a ramping λ; telescopes to t₀ + i/λ at fixed λ), and stamps IntendedSendTime at the true scheduled instant. - Overflow is bounded-in-flight + drop-and-count: a non-blocking semaphore TryAcquire admits the tx or drops-and-counts it; the arrival clock is never blocked on capacity (REL8/REL9 load shedding). - One rate authority preserved: the ramper still drives λ via limiter.SetLimit; the worker's busy-spin Allow() gate is replaced by a blocking Wait (closed-loop only) and disabled under open-loop. - Behind config flag arrival-model (default closed_loop, the regression baseline) + max-in-flight; arrival_model and run_txs_dropped_total are recorded at run end. - types.LoadTx gains SequenceIndex (scheduler-owned, single-write per the documented concurrency contract) for PLT-463 schedule-lag attribution; dropped txs carry zero InclusionTime and are kept out of inclusion-rate denominators. Tests: schedule-accuracy (tracks t₀ + i/λ within tolerance), clock not throttled by a slow sender (overrun dropped not blocked), ramped-λ gap shrink, and stamp-before-handoff under -race. go build + go test -race green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… bound B1: reject open_loop without a finite positive arrival rate. With TPS=0 and no ramp, λ=rate.Inf, the inter-arrival gap collapses to 0, IntendedSendTime never advances past t₀, and the scheduler spins and drops everything. Add config.Settings.Validate (TPS>0 or --ramp-up required for open_loop) and call it after ResolveSettings; fail fast with a clear error. minScheduleRate stays as divide-by-zero/+Inf defense-in-depth only. B2: tie the in-flight permit to real send completion, not enqueue. Worker.Send (via ShardedSender) returns at enqueue, so the prior defer release() bounded enqueue backlog, not unacked sends — and dropped reflected buffer geometry. Thread a LoadTx.OnComplete hook the worker invokes after sendTransaction; the scheduler stamps it to release the permit so maxInFlight bounds true in-flight and dropped measures genuine load-shed. Enqueue-failure path completes inline. schedule_lag (PLT-463) remains the primary CO-detection gate. Also: reject unrecognized --arrival-model at config load; drop the SequenceIndex self-disambiguation claim (gate on run-level arrival_model); state the dropped-tx inclusion-denominator invariant plainly; fix relase typo. Tests: replace the synchronous fakeSender with an async enqueue-and-complete sender so the slow-sender drop test exercises production semantics; add the permit-held-until-completion guard, a conservation invariant, and config validation rejection tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR SummaryMedium Risk Overview Configuration and wiring. New Scheduler and pipeline. Observability and types. Reviewed by Cursor Bugbot for commit 33d7237. Bugbot is set up for automated code reviews on this repo. Configure here. |
Omit the redundant time.Duration type from the minGap declaration in scheduler_test (inferred from time.Hour). Caught by golangci-lint (CI gate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-loop-scheduler # Conflicts: # main.go # sender/sharded_sender.go # sender/worker.go
| } | ||
| d.mu.Lock() | ||
| d.totalSent++ | ||
| d.mu.Unlock() |
There was a problem hiding this comment.
Failed sends break issued accounting
Medium Severity
Open-loop onSent increments totalSent only when the worker reports a nil send error. Transactions that pass admission and enqueue but fail in sendTransaction release the in-flight permit and are logged, yet they are neither counted as sent nor as dropped, so issued == sent + dropped and related summaries can be wrong under RPC failures.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 3cdaf95. Configure here.
Every existing scheduler test drove a fake TxSender that fired tx.OnComplete itself, so the suite stayed green even if the real Worker forgot to invoke it in runTxSender — leaking the open-loop in-flight semaphore (permits never released → the maxInFlight bound becomes meaningless). Add an httptest JSON-RPC harness (answers eth_sendRawTransaction, the only RPC the ethclient send path issues) behind the real Worker + open-loop scheduler: - Conservation on the real path: issued == completed + dropped, with completed driven by the real worker's OnComplete; handled sends matched against the real RPC server's count. - Permit released by the worker: maxInFlight=1 with a server that blocks one send holds exactly one in flight and drops the rest; releasing it resumes flow — which is only possible if the worker fires OnComplete. Both tests fail when the OnComplete invoke is removed (verified). Also clarify the scheduler doc: enqueue is async but the RPC send is synchronous, so the permit is held for the full round-trip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The conservation test sampled `succeeded == handled` against an in-flight window: the httptest server bumps `handled` on receiving eth_sendRawTransaction, but the worker bumps `succeeded` only after SendTransaction returns and OnComplete fires. CI's slower scheduling caught that window (handled=200, succeeded=199). The dominant cause was teardown ordering, not pure sampling: the scheduler ran as the scope's main task, so the instant it exhausted the generator and returned, service.Run canceled the worker's context — aborting the last send whose 200 OK the server had already counted. That send completed with context-canceled, so completed++ but not succeeded++. Fix: run the scheduler and worker as background tasks behind a main gate that blocks until the test tears down, so the scope stays alive until quiescence. Anchor the assertion on the fixed total and require exhaustion, conservation, and equality together in one predicate, evaluated only once they all hold (a stable fixpoint, since the counters are monotonic and no new work is issued after exhaustion). Correctness depends on convergence, not the deadline. Verified: go test -race -count=50 and -count=20 -cpu=1,2,4 green; GOMAXPROCS=1 and =2 green. Falsification holds — commenting out the OnComplete invoke in runTxSender fails both tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 0167803. Configure here.
| Debug: cfg.Settings.Debug, | ||
| Collector: collector, | ||
| Limiter: limiter, | ||
| RateLimited: rateLimited, |
There was a problem hiding this comment.
Prewarm skips open-loop rate limit
Medium Severity
With arrival_model set to open loop, workers are built with RateLimited false for the whole run. Prewarm still uses those workers before the open-loop scheduler starts, so prewarm txs no longer honor the shared rate.Limiter and can flood endpoints at channel speed instead of configured TPS.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 0167803. Configure here.
Move the dense coordinated-omission / arrival-model narrative out of scheduler.go into a new sender/doc.go package doc, and lean the inline comments to terse pointers. No behavior change. The load-bearing inline notes stay (leaned) at the code they guard: the worker's OnComplete permit-release and the single-writer stamp-before- hand-off contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


Implements PLT-458 — the core of the coordinated-omission fix. Issues at
t_i = t₀ + i/λindependent of in-flight completion, so latency no longer hides the backlog.What
sender/scheduler.go(openLoopScheduler): arrival clock att₀ + i/λ(absolute-instant sleep → drift-free), stampsIntendedSendTimeat the true scheduled instant + a per-txSequenceIndex. Workers become pure async senders (thelimiter.Allow()busy-spin is gone).LoadTx.OnCompletehook fired by the worker aftersendTransaction), sodroppedmeasures genuine load-shed, not buffer geometry.rate.Limiter. Behind a config flag (--arrival-model), with the legacy closed-loop path retained as the regression baseline.arrival_modelrecorded.Review (systems + measurement + idiom, two rounds)
The loop caught and fixed two blocking issues before merge:
TPS=0→rate.Inf→gap=0→ a degenerate constant latency anchor. Now rejected at config validation (open-loop requires finite positive λ:TPS>0or a ramp).droppedmeasured buffer geometry and a synchronous test masked it. Fixed to release at real send completion; test now uses an async sender + a directPermitHeldUntilCompletionguard. Verified: concurrency-correct (sync.Once, happens-before intact,-race -count=20clean), conservationissued == sent + droppedholds.schedule_lag(PLT-463) remains the primary CO-detection gate. Forward-note: inclusion-rate denominators must usesent, neverissued.Decision brief:
designs/sei-load-workload-modeler/PLT-458-open-loop-scheduler.md.🤖 Generated with Claude Code