forester: fix V1 nullify presort retry storm and reduce RPC pressure#2382
forester: fix V1 nullify presort retry storm and reduce RPC pressure#2382sergeytimoshin wants to merge 5 commits into
Conversation
The V1 send path's optional get_queue_leaf_indices "presort" was effectively always on (gated only by enable_v1_multi_nullify). On an indexer that does not implement the endpoint it returns 404, which the Photon client treated as a retryable ApiError and backed off ~44s (10 retries) per send. That consumed the entire per-slot send budget, so the chunk aborted at the deadline guard before building/sending any transaction — V1 queues never drained. Changes: - photon_indexer: treat an ApiError whose message contains "method not found" or "status 404" as non-retryable (fail fast instead of ~44s backoff). - forester: add `--enable-v1-presort` (env ENABLE_V1_PRESORT), default off, and gate the get_queue_leaf_indices presort path on it. - forester: replace the fixed 200ms V2 queue poll with adaptive backoff (200ms -> 10s cap, reset to 200ms when work is found) to stop idle V2 trees from re-fetching the queue ~5x/sec for the whole eligible window. - forester: lower default --rpc-pool-size 100 -> 32. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughAdds a CLI/config toggle for optional V1 presort and lowers the RPC pool default to 32; introduces a process-wide semaphore to limit V1 send concurrency, adaptive backoff for V2 polling, conditional compressible-provider spawn, smart-transaction resend rate-limiting, and indexer 404/method-not-found non-retry classification. ChangesForester runtime and send-path updates
Sequence DiagramsequenceDiagram
participant EpochManager
participant SendBatchedTransactions
participant V1_Send_Semaphore
participant RPCNode
EpochManager->>SendBatchedTransactions: request send batch (v1)
SendBatchedTransactions->>V1_Send_Semaphore: try acquire N permits (deadline-bound)
V1_Send_Semaphore-->>SendBatchedTransactions: permits granted / timeout / closed
SendBatchedTransactions->>RPCNode: submit transactions (bounded concurrency)
RPCNode-->>SendBatchedTransactions: send result / error
SendBatchedTransactions-->>EpochManager: aggregated chunk results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 412b3356de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| tokio::time::sleep(poll_interval).await; | ||
| poll_interval = (poll_interval * 2).min(POLL_INTERVAL_MAX); |
There was a problem hiding this comment.
Cap idle V2 polling to the light-slot budget
When a V2 tree is idle early in its eligible light slot, this sleep backs off without considering how much of the light slot remains. I checked the protocol defaults (programs/registry/src/protocol_config/state.rs): local/default slot_length is 10 Solana slots (~4s) and testnet is 60 (~24s), so after several empty polls the forester can sleep 6.4s/10s, wake after forester_slot_details.end_solana_slot, and skip work that arrived during the sleep until a later eligibility window. Cap the sleep/backoff by the remaining light-slot time or keep it well below the slot length so late-arriving V2 queue items can still be processed in the current slot.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@forester/src/epoch_manager.rs`:
- Around line 2268-2275: POLL_INTERVAL_MIN and POLL_INTERVAL_MAX are hardcoded;
make them configurable via CLI and environment variables and use those values
wherever the constants are referenced (e.g., POLL_INTERVAL_MIN,
POLL_INTERVAL_MAX, and the local poll_interval variable in epoch_manager.rs and
the other occurrences you noted). Add config fields (e.g.,
v2_poll_interval_min_ms and v2_poll_interval_max_ms) to the application
Config/opts (supporting both CLI flags and env vars), parse them into Durations
(defaulting to Duration::from_millis(200) and Duration::from_secs(10) when
unspecified), and replace the constants/inline literals with values from that
Config (ensure validation: min <= max and reasonable bounds). Ensure all uses
(including the other instances at the referenced locations) read from the same
config struct so deployments can tune RPC pressure without code changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 22decbc8-303a-44d0-aa19-6feb45d77c75
📒 Files selected for processing (4)
forester/src/cli.rsforester/src/config.rsforester/src/epoch_manager.rssdk-libs/client/src/indexer/photon_indexer.rs
| // Adaptive queue polling: start responsive, then back off (capped) while the | ||
| // queue has nothing ready to process, and reset to the minimum as soon as work | ||
| // is found. A fixed 200ms poll made idle V2 trees re-fetch the queue ~5x/sec for | ||
| // the whole eligible window, which is the dominant source of wasted RPC/indexer | ||
| // load (and can exhaust a shared RPC credit budget). | ||
| const POLL_INTERVAL_MIN: Duration = Duration::from_millis(200); | ||
| const POLL_INTERVAL_MAX: Duration = Duration::from_secs(10); | ||
| let mut poll_interval = POLL_INTERVAL_MIN; |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift
Make V2 adaptive polling bounds configurable via CLI/env.
Line 2273 and Line 2274 hardcode POLL_INTERVAL_MIN/POLL_INTERVAL_MAX. These are operational tuning knobs and should be externally configurable so deployments can tune RPC pressure vs responsiveness without code changes.
As per coding guidelines, “Use environment variables for configuration instead of hardcoded values” and “Support both CLI arguments and environment variables for all configuration options.”
Also applies to: 2346-2349, 2362-2363
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@forester/src/epoch_manager.rs` around lines 2268 - 2275, POLL_INTERVAL_MIN
and POLL_INTERVAL_MAX are hardcoded; make them configurable via CLI and
environment variables and use those values wherever the constants are referenced
(e.g., POLL_INTERVAL_MIN, POLL_INTERVAL_MAX, and the local poll_interval
variable in epoch_manager.rs and the other occurrences you noted). Add config
fields (e.g., v2_poll_interval_min_ms and v2_poll_interval_max_ms) to the
application Config/opts (supporting both CLI flags and env vars), parse them
into Durations (defaulting to Duration::from_millis(200) and
Duration::from_secs(10) when unspecified), and replace the constants/inline
literals with values from that Config (ensure validation: min <= max and
reasonable bounds). Ensure all uses (including the other instances at the
referenced locations) read from the same config struct so deployments can tune
RPC pressure without code changes.
Source: Coding guidelines
The api_server unconditionally spawned run_compressible_provider, which with no in-memory trackers falls back to a full paginated getProgramAccounts scan every 30s. Even with compressible tracking disabled (the default), that heavy scan ties up RPC pool connections and triggers "Failed to get RPC connection" pool failures on mainnet. Only spawn the provider when there is a cheap source — in-memory trackers (compression enabled) or upstream forester APIs to aggregate. Otherwise skip it; the dashboard simply reports no compressible counts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
forester/src/processor/v1/send_transaction.rs (1)
216-233:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReserve send permits by transaction count, not work-item count.
permits_to_reserveis derived fromwork_chunk.len(), butbuild_signed_transaction_batch(..., build_config)can collapse multiple work items into a single transaction whenbuild_config.batch_size/legacy_ixs_per_txis greater than 1. That over-reserves the process-wide semaphore, strands unused permits, and under contention can make a chunk returnOk(())with 0 sends even though there was enough capacity for the chunk’s actual transaction count. Base the reservation on the number of transactions this chunk can emit (or at leastdiv_ceil(work_chunk.len(), batch_size)) so the limiter matches real send concurrency.Suggested direction
- let permits_to_reserve = work_chunk.len().min(effective_max_concurrent_sends).max(1); + let tx_batch_size = usize::try_from(build_config.batch_size) + .unwrap_or(usize::MAX) + .max(1); + let permits_to_reserve = work_chunk + .len() + .div_ceil(tx_batch_size) + .min(effective_max_concurrent_sends) + .max(1);Also applies to: 248-257
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@forester/src/processor/v1/send_transaction.rs` around lines 216 - 233, The current code computes permits_to_reserve from work_chunk.len(), which over-reserves when build_signed_transaction_batch (and its build_config.batch_size / legacy_ixs_per_tx) can combine multiple work items into fewer transactions; change the reservation logic used before calling acquire_send_permits (and the identical logic in the other chunk where permits are reserved) to compute the number of transactions instead, e.g. div_ceil(work_chunk.len(), effective_batch_size) where effective_batch_size = build_config.batch_size * legacy_ixs_per_tx (or the exact contraction logic used by build_signed_transaction_batch), then .min(effective_max_concurrent_sends).max(1) to produce permits_to_reserve so the semaphore matches actual transaction concurrency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@forester/src/processor/v1/send_transaction.rs`:
- Around line 216-233: The current code computes permits_to_reserve from
work_chunk.len(), which over-reserves when build_signed_transaction_batch (and
its build_config.batch_size / legacy_ixs_per_tx) can combine multiple work items
into fewer transactions; change the reservation logic used before calling
acquire_send_permits (and the identical logic in the other chunk where permits
are reserved) to compute the number of transactions instead, e.g.
div_ceil(work_chunk.len(), effective_batch_size) where effective_batch_size =
build_config.batch_size * legacy_ixs_per_tx (or the exact contraction logic used
by build_signed_transaction_batch), then
.min(effective_max_concurrent_sends).max(1) to produce permits_to_reserve so the
semaphore matches actual transaction concurrency.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0c062dbc-9528-4a41-9dde-0f98638a1d58
📒 Files selected for processing (1)
forester/src/processor/v1/send_transaction.rs
Problem
V1 state/address nullification silently stopped sending transactions: queues had pending items, the forester fetched them (
work_items=7), entered the send loop, then completed in microseconds with 0 sent — no proofs fetched, no tx built, no error. Queues never drained, even with healthy RPC, a caught-up indexer, and no OOM.Root cause
The V1 send path runs an optional
get_queue_leaf_indicespresort step (gated only onenable_v1_multi_nullify, so effectively always on). On an indexer that doesn't implement the endpoint it returns 404 "Method not found", which the Photon client's retry wrapper treated as a retryableApiErrorand retried 10× with exponential backoff:timeout_deadlineis anchored atfunction_start(a slot-derived budget of ~20s), so by the time the ~44s presort returned, the send chunk aborted at theInstant::now() >= timeout_deadlineguard before doing any work. Confirmed with instrumentation: chunk entry showedexpired=true, overdue=~22-35s.Changes
photon_indexer: anApiErrorwhose message contains"method not found"/"status 404"is now non-retryable — fail fast instead of ~44s of backoff. Affects any caller hitting a missing endpoint.--enable-v1-presort(envENABLE_V1_PRESORT), default off: theget_queue_leaf_indicespresort path no longer runs unless explicitly enabled (best-effort optimization; requires an indexer that implements it).200msqueue poll with backoff200ms → 10s(reset to200mswhen work is found). Idle V2 trees were re-fetching the queue ~5×/sec for the entire eligible window — the dominant source of wasted RPC/indexer load. Throughput while draining is unaffected (the loop still re-runs immediately when items are found).--rpc-pool-size100 → 32.Verification
Built from this branch and run on devnet + mainnet foresters: V1 nullify resumed (
tx sent … type=StateV1MultiNullify), queues drained7 → 0, transactions finalized on-chain (err=None), and0presort calls with the flag off.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements