Skip to content

PLT-458: Open-loop scheduler (replace closed-loop dequeue)#48

Open
bdchatham wants to merge 7 commits into
mainfrom
brandon2/plt-458-open-loop-scheduler
Open

PLT-458: Open-loop scheduler (replace closed-loop dequeue)#48
bdchatham wants to merge 7 commits into
mainfrom
brandon2/plt-458-open-loop-scheduler

Conversation

@bdchatham

Copy link
Copy Markdown
Contributor

Implements PLT-458 — the core of the coordinated-omission fix. Issues at t_i = t₀ + i/λ independent of in-flight completion, so latency no longer hides the backlog.

What

  • New sender/scheduler.go (openLoopScheduler): arrival clock at t₀ + i/λ (absolute-instant sleep → drift-free), stamps IntendedSendTime at the true scheduled instant + a per-tx SequenceIndex. Workers become pure async senders (the limiter.Allow() busy-spin is gone).
  • Bounded in-flight + drop-and-count: a semaphore caps true unacked sends; on overflow the overdue tx is dropped and counted — the arrival clock is never throttled by backpressure. The permit is released at real send completion (a LoadTx.OnComplete hook fired by the worker after sendTransaction), so dropped measures genuine load-shed, not buffer geometry.
  • One rate authority: λ comes from the ramper's shared rate.Limiter. Behind a config flag (--arrival-model), with the legacy closed-loop path retained as the regression baseline. arrival_model recorded.

Review (systems + measurement + idiom, two rounds)

The loop caught and fixed two blocking issues before merge:

  • B1: open-loop with the default TPS=0rate.Infgap=0 → a degenerate constant latency anchor. Now rejected at config validation (open-loop requires finite positive λ: TPS>0 or a ramp).
  • B2: the in-flight semaphore originally released at enqueue, so dropped measured buffer geometry and a synchronous test masked it. Fixed to release at real send completion; test now uses an async sender + a direct PermitHeldUntilCompletion guard. Verified: concurrency-correct (sync.Once, happens-before intact, -race -count=20 clean), conservation issued == sent + dropped holds.

schedule_lag (PLT-463) remains the primary CO-detection gate. Forward-note: inclusion-rate denominators must use sent, never issued.

Decision brief: designs/sei-load-workload-modeler/PLT-458-open-loop-scheduler.md.

🤖 Generated with Claude Code

bdchatham and others added 2 commits June 12, 2026 07:58
Make transaction arrival open-loop to fix coordinated omission: tx i is
issued at t₀ + i/λ independent of in-flight completion, so a slow SUT no
longer slows the generator and hides backlog in latency.

- sender/scheduler.go: openLoopScheduler owns t₀ and the monotonic
  sequence index i, derives λ from the shared rate.Limiter as a clock
  source (sampled per tick to honor a ramping λ; telescopes to t₀ + i/λ
  at fixed λ), and stamps IntendedSendTime at the true scheduled instant.
- Overflow is bounded-in-flight + drop-and-count: a non-blocking
  semaphore TryAcquire admits the tx or drops-and-counts it; the arrival
  clock is never blocked on capacity (REL8/REL9 load shedding).
- One rate authority preserved: the ramper still drives λ via
  limiter.SetLimit; the worker's busy-spin Allow() gate is replaced by a
  blocking Wait (closed-loop only) and disabled under open-loop.
- Behind config flag arrival-model (default closed_loop, the regression
  baseline) + max-in-flight; arrival_model and run_txs_dropped_total are
  recorded at run end.
- types.LoadTx gains SequenceIndex (scheduler-owned, single-write per the
  documented concurrency contract) for PLT-463 schedule-lag attribution;
  dropped txs carry zero InclusionTime and are kept out of inclusion-rate
  denominators.

Tests: schedule-accuracy (tracks t₀ + i/λ within tolerance), clock not
throttled by a slow sender (overrun dropped not blocked), ramped-λ gap
shrink, and stamp-before-handoff under -race. go build + go test -race
green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… bound

B1: reject open_loop without a finite positive arrival rate. With TPS=0 and
no ramp, λ=rate.Inf, the inter-arrival gap collapses to 0, IntendedSendTime
never advances past t₀, and the scheduler spins and drops everything. Add
config.Settings.Validate (TPS>0 or --ramp-up required for open_loop) and call
it after ResolveSettings; fail fast with a clear error. minScheduleRate stays
as divide-by-zero/+Inf defense-in-depth only.

B2: tie the in-flight permit to real send completion, not enqueue. Worker.Send
(via ShardedSender) returns at enqueue, so the prior defer release() bounded
enqueue backlog, not unacked sends — and dropped reflected buffer geometry.
Thread a LoadTx.OnComplete hook the worker invokes after sendTransaction; the
scheduler stamps it to release the permit so maxInFlight bounds true in-flight
and dropped measures genuine load-shed. Enqueue-failure path completes inline.
schedule_lag (PLT-463) remains the primary CO-detection gate.

Also: reject unrecognized --arrival-model at config load; drop the
SequenceIndex self-disambiguation claim (gate on run-level arrival_model);
state the dropped-tx inclusion-denominator invariant plainly; fix relase typo.

Tests: replace the synchronous fakeSender with an async enqueue-and-complete
sender so the slow-sender drop test exercises production semantics; add the
permit-held-until-completion guard, a conservation invariant, and config
validation rejection tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cursor

cursor Bot commented Jun 12, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes core load-generation timing and default remains closed-loop, but open-loop alters latency measurement semantics and in-flight accounting; mistakes in OnComplete or validation could skew benchmarks or silently drop load.

Overview
Introduces an open-loop transaction arrival path to fix coordinated omission: txs are scheduled at t₀ + i/λ from the shared rate limiter, independent of sender backlog, with closed_loop unchanged as the default baseline.

Configuration and wiring. New --arrival-model and --max-in-flight settings (plus Settings.Validate() rejecting open-loop without finite λ via TPS>0 or --ramp-up). main enables open-loop on the live-send dispatcher path, disables worker-side limiter gating when open-loop is active, and passes arrival model + drop count into run-summary metrics.

Scheduler and pipeline. openLoopScheduler sleeps to absolute schedule instants, stamps IntendedSendTime and SequenceIndex, admits sends via non-blocking Semaphore.TryAcquire, and drops and counts when in-flight is saturated. Permits release on LoadTx.OnComplete after the worker’s real RPC send (not at enqueue). Dispatcher branches between closed-loop generate-send lockstep and open-loop; workers gain optional RateLimited and the OnComplete hook.

Observability and types. run_txs_dropped_total (tagged by arrival_model) and extended LoadTx / package docs for schedule-lag semantics. Broad unit tests plus real-worker integration tests guard schedule accuracy, drop behavior, permit lifecycle, and conservation.

Reviewed by Cursor Bugbot for commit 33d7237. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread sender/scheduler.go
Omit the redundant time.Duration type from the minGap declaration in
scheduler_test (inferred from time.Hour). Caught by golangci-lint (CI gate).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-loop-scheduler

# Conflicts:
#	main.go
#	sender/sharded_sender.go
#	sender/worker.go
Comment thread sender/dispatcher.go
}
d.mu.Lock()
d.totalSent++
d.mu.Unlock()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed sends break issued accounting

Medium Severity

Open-loop onSent increments totalSent only when the worker reports a nil send error. Transactions that pass admission and enqueue but fail in sendTransaction release the in-flight permit and are logged, yet they are neither counted as sent nor as dropped, so issued == sent + dropped and related summaries can be wrong under RPC failures.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3cdaf95. Configure here.

bdchatham and others added 2 commits June 12, 2026 11:47
Every existing scheduler test drove a fake TxSender that fired tx.OnComplete
itself, so the suite stayed green even if the real Worker forgot to invoke it
in runTxSender — leaking the open-loop in-flight semaphore (permits never
released → the maxInFlight bound becomes meaningless).

Add an httptest JSON-RPC harness (answers eth_sendRawTransaction, the only RPC
the ethclient send path issues) behind the real Worker + open-loop scheduler:

- Conservation on the real path: issued == completed + dropped, with completed
  driven by the real worker's OnComplete; handled sends matched against the
  real RPC server's count.
- Permit released by the worker: maxInFlight=1 with a server that blocks one
  send holds exactly one in flight and drops the rest; releasing it resumes
  flow — which is only possible if the worker fires OnComplete.

Both tests fail when the OnComplete invoke is removed (verified). Also clarify
the scheduler doc: enqueue is async but the RPC send is synchronous, so the
permit is held for the full round-trip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The conservation test sampled `succeeded == handled` against an in-flight
window: the httptest server bumps `handled` on receiving
eth_sendRawTransaction, but the worker bumps `succeeded` only after
SendTransaction returns and OnComplete fires. CI's slower scheduling caught
that window (handled=200, succeeded=199).

The dominant cause was teardown ordering, not pure sampling: the scheduler ran
as the scope's main task, so the instant it exhausted the generator and
returned, service.Run canceled the worker's context — aborting the last send
whose 200 OK the server had already counted. That send completed with
context-canceled, so completed++ but not succeeded++.

Fix: run the scheduler and worker as background tasks behind a main gate that
blocks until the test tears down, so the scope stays alive until quiescence.
Anchor the assertion on the fixed total and require exhaustion, conservation,
and equality together in one predicate, evaluated only once they all hold (a
stable fixpoint, since the counters are monotonic and no new work is issued
after exhaustion). Correctness depends on convergence, not the deadline.

Verified: go test -race -count=50 and -count=20 -cpu=1,2,4 green; GOMAXPROCS=1
and =2 green. Falsification holds — commenting out the OnComplete invoke in
runTxSender fails both tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0167803. Configure here.

Comment thread sender/sharded_sender.go
Debug: cfg.Settings.Debug,
Collector: collector,
Limiter: limiter,
RateLimited: rateLimited,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prewarm skips open-loop rate limit

Medium Severity

With arrival_model set to open loop, workers are built with RateLimited false for the whole run. Prewarm still uses those workers before the open-loop scheduler starts, so prewarm txs no longer honor the shared rate.Limiter and can flood endpoints at channel speed instead of configured TPS.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0167803. Configure here.

Move the dense coordinated-omission / arrival-model narrative out of
scheduler.go into a new sender/doc.go package doc, and lean the inline
comments to terse pointers. No behavior change.

The load-bearing inline notes stay (leaned) at the code they guard: the
worker's OnComplete permit-release and the single-writer stamp-before-
hand-off contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant