Skip to content

forester: fix V1 nullify presort retry storm and reduce RPC pressure#2382

Open
sergeytimoshin wants to merge 5 commits into
mainfrom
sergey/forester-rpc-efficiency
Open

forester: fix V1 nullify presort retry storm and reduce RPC pressure#2382
sergeytimoshin wants to merge 5 commits into
mainfrom
sergey/forester-rpc-efficiency

Conversation

@sergeytimoshin

@sergeytimoshin sergeytimoshin commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Problem

V1 state/address nullification silently stopped sending transactions: queues had pending items, the forester fetched them (work_items=7), entered the send loop, then completed in microseconds with 0 sent — no proofs fetched, no tx built, no error. Queues never drained, even with healthy RPC, a caught-up indexer, and no OOM.

Root cause

The V1 send path runs an optional get_queue_leaf_indices presort step (gated only on enable_v1_multi_nullify, so effectively always on). On an indexer that doesn't implement the endpoint it returns 404 "Method not found", which the Photon client's retry wrapper treated as a retryable ApiError and retried 10× with exponential backoff:

400 + 800 + 1600 + 3200 + 6400 + 8000×4 ≈ 44s

timeout_deadline is anchored at function_start (a slot-derived budget of ~20s), so by the time the ~44s presort returned, the send chunk aborted at the Instant::now() >= timeout_deadline guard before doing any work. Confirmed with instrumentation: chunk entry showed expired=true, overdue=~22-35s.

Changes

  • photon_indexer: an ApiError whose message contains "method not found" / "status 404" is now non-retryable — fail fast instead of ~44s of backoff. Affects any caller hitting a missing endpoint.
  • --enable-v1-presort (env ENABLE_V1_PRESORT), default off: the get_queue_leaf_indices presort path no longer runs unless explicitly enabled (best-effort optimization; requires an indexer that implements it).
  • Adaptive V2 poll: replaced the fixed 200ms queue poll with backoff 200ms → 10s (reset to 200ms when work is found). Idle V2 trees were re-fetching the queue ~5×/sec for the entire eligible window — the dominant source of wasted RPC/indexer load. Throughput while draining is unaffected (the loop still re-runs immediately when items are found).
  • Default --rpc-pool-size 100 → 32.

Verification

Built from this branch and run on devnet + mainnet foresters: V1 nullify resumed (tx sent … type=StateV1MultiNullify), queues drained 7 → 0, transactions finalized on-chain (err=None), and 0 presort calls with the flag off.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • CLI flag to enable best-effort pre-sorting of V1 work items.
    • Config option exposing the new pre-sort toggle.
  • Improvements

    • Reduced default RPC pool size from 100 to 32.
    • Global concurrency limiting for V1 transaction sends with bounded batching.
    • Rate-limited transaction resends to reduce retry load.
    • Adaptive queue polling with exponential backoff when idle.
    • Indexer error handling skips retries for certain non-retryable responses.
    • Background compressible provider spawned only when a cheap source exists.

The V1 send path's optional get_queue_leaf_indices "presort" was effectively
always on (gated only by enable_v1_multi_nullify). On an indexer that does not
implement the endpoint it returns 404, which the Photon client treated as a
retryable ApiError and backed off ~44s (10 retries) per send. That consumed the
entire per-slot send budget, so the chunk aborted at the deadline guard before
building/sending any transaction — V1 queues never drained.

Changes:
- photon_indexer: treat an ApiError whose message contains "method not found"
  or "status 404" as non-retryable (fail fast instead of ~44s backoff).
- forester: add `--enable-v1-presort` (env ENABLE_V1_PRESORT), default off, and
  gate the get_queue_leaf_indices presort path on it.
- forester: replace the fixed 200ms V2 queue poll with adaptive backoff
  (200ms -> 10s cap, reset to 200ms when work is found) to stop idle V2 trees
  from re-fetching the queue ~5x/sec for the whole eligible window.
- forester: lower default --rpc-pool-size 100 -> 32.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3939782a-e251-43b2-9330-a83444207c74

📥 Commits

Reviewing files that changed from the base of the PR and between 2627db3 and aa948ea.

📒 Files selected for processing (5)
  • forester/tests/e2e_test.rs
  • forester/tests/legacy/priority_fee_test.rs
  • forester/tests/legacy/test_utils.rs
  • forester/tests/priority_fee_test.rs
  • forester/tests/test_utils.rs

📝 Walkthrough

Walkthrough

Adds a CLI/config toggle for optional V1 presort and lowers the RPC pool default to 32; introduces a process-wide semaphore to limit V1 send concurrency, adaptive backoff for V2 polling, conditional compressible-provider spawn, smart-transaction resend rate-limiting, and indexer 404/method-not-found non-retry classification.

Changes

Forester runtime and send-path updates

Layer / File(s) Summary
V1 Presort CLI and Configuration Wiring
forester/src/cli.rs, forester/src/config.rs
Reduces RPC pool default from 100→32, adds --enable-v1-presort/ENABLE_V1_PRESORT, and wires ForesterConfig.enable_v1_presort into new_for_start, new_for_status (false), and Clone.
Photon indexer: non-retryable 404/method-not-found
sdk-libs/client/src/indexer/photon_indexer.rs
Classifies IndexerError::ApiError messages containing “method not found” or “status 404” as non-retryable and logs accordingly; other API errors continue to be retried.
Conditional compressible provider spawn
forester/src/api_server.rs
run_compressible_provider is spawned only if in-memory trackers exist or config.forester_api_urls is non-empty.
EpochManager: semaphore, presort toggle, adaptive polling
forester/src/epoch_manager.rs
Adds process-wide v1_send_permits: Arc<Semaphore> initialized from transaction_config.max_concurrent_sends, passes it into V1 batching, sets enable_presort from config, and replaces fixed 200ms idle sleep with exponential-backoff polling (200ms doubling to 10s, reset on work).
V1 send: chunking, permit acquisition, bounded concurrency
forester/src/processor/v1/send_transaction.rs
Introduces shared send_permits semaphore, computes effective batch sizes, pre-acquires per-chunk permits with deadline-bound acquisition, limits concurrent chunks, uses buffer_unordered for results, and maps permit timeouts/closed-limiter to explicit send result variants.
Smart-transaction resend rate-limiting
forester/src/smart_transaction.rs
Adds MIN_TRANSACTION_RESEND_INTERVAL, computes resend interval from poll interval (×4 with minimum), and only invokes send when scheduled next_send_at is reached; advances schedule after each attempt.

Sequence Diagram

sequenceDiagram
  participant EpochManager
  participant SendBatchedTransactions
  participant V1_Send_Semaphore
  participant RPCNode
  EpochManager->>SendBatchedTransactions: request send batch (v1)
  SendBatchedTransactions->>V1_Send_Semaphore: try acquire N permits (deadline-bound)
  V1_Send_Semaphore-->>SendBatchedTransactions: permits granted / timeout / closed
  SendBatchedTransactions->>RPCNode: submit transactions (bounded concurrency)
  RPCNode-->>SendBatchedTransactions: send result / error
  SendBatchedTransactions-->>EpochManager: aggregated chunk results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

ai-review

Suggested reviewers

  • SwenSchaeferjohann
  • ananas-block

Poem

Semaphore guards the V1 parade,
Polling learns to pause, then trade,
Presort waits behind a flag so small,
404s no longer call for thrall,
RPC trimmed down — a leaner hall.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objectives: fixing V1 nullify presort retry storms and reducing RPC pressure through several coordinated changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 70.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sergey/forester-rpc-efficiency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 412b3356de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +2347 to +2348
tokio::time::sleep(poll_interval).await;
poll_interval = (poll_interval * 2).min(POLL_INTERVAL_MAX);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cap idle V2 polling to the light-slot budget

When a V2 tree is idle early in its eligible light slot, this sleep backs off without considering how much of the light slot remains. I checked the protocol defaults (programs/registry/src/protocol_config/state.rs): local/default slot_length is 10 Solana slots (~4s) and testnet is 60 (~24s), so after several empty polls the forester can sleep 6.4s/10s, wake after forester_slot_details.end_solana_slot, and skip work that arrived during the sleep until a later eligibility window. Cap the sleep/backoff by the remaining light-slot time or keep it well below the slot length so late-arriving V2 queue items can still be processed in the current slot.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@forester/src/epoch_manager.rs`:
- Around line 2268-2275: POLL_INTERVAL_MIN and POLL_INTERVAL_MAX are hardcoded;
make them configurable via CLI and environment variables and use those values
wherever the constants are referenced (e.g., POLL_INTERVAL_MIN,
POLL_INTERVAL_MAX, and the local poll_interval variable in epoch_manager.rs and
the other occurrences you noted). Add config fields (e.g.,
v2_poll_interval_min_ms and v2_poll_interval_max_ms) to the application
Config/opts (supporting both CLI flags and env vars), parse them into Durations
(defaulting to Duration::from_millis(200) and Duration::from_secs(10) when
unspecified), and replace the constants/inline literals with values from that
Config (ensure validation: min <= max and reasonable bounds). Ensure all uses
(including the other instances at the referenced locations) read from the same
config struct so deployments can tune RPC pressure without code changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 22decbc8-303a-44d0-aa19-6feb45d77c75

📥 Commits

Reviewing files that changed from the base of the PR and between 5d58e11 and 412b335.

📒 Files selected for processing (4)
  • forester/src/cli.rs
  • forester/src/config.rs
  • forester/src/epoch_manager.rs
  • sdk-libs/client/src/indexer/photon_indexer.rs

Comment on lines +2268 to +2275
// Adaptive queue polling: start responsive, then back off (capped) while the
// queue has nothing ready to process, and reset to the minimum as soon as work
// is found. A fixed 200ms poll made idle V2 trees re-fetch the queue ~5x/sec for
// the whole eligible window, which is the dominant source of wasted RPC/indexer
// load (and can exhaust a shared RPC credit budget).
const POLL_INTERVAL_MIN: Duration = Duration::from_millis(200);
const POLL_INTERVAL_MAX: Duration = Duration::from_secs(10);
let mut poll_interval = POLL_INTERVAL_MIN;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Make V2 adaptive polling bounds configurable via CLI/env.

Line 2273 and Line 2274 hardcode POLL_INTERVAL_MIN/POLL_INTERVAL_MAX. These are operational tuning knobs and should be externally configurable so deployments can tune RPC pressure vs responsiveness without code changes.

As per coding guidelines, “Use environment variables for configuration instead of hardcoded values” and “Support both CLI arguments and environment variables for all configuration options.”

Also applies to: 2346-2349, 2362-2363

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@forester/src/epoch_manager.rs` around lines 2268 - 2275, POLL_INTERVAL_MIN
and POLL_INTERVAL_MAX are hardcoded; make them configurable via CLI and
environment variables and use those values wherever the constants are referenced
(e.g., POLL_INTERVAL_MIN, POLL_INTERVAL_MAX, and the local poll_interval
variable in epoch_manager.rs and the other occurrences you noted). Add config
fields (e.g., v2_poll_interval_min_ms and v2_poll_interval_max_ms) to the
application Config/opts (supporting both CLI flags and env vars), parse them
into Durations (defaulting to Duration::from_millis(200) and
Duration::from_secs(10) when unspecified), and replace the constants/inline
literals with values from that Config (ensure validation: min <= max and
reasonable bounds). Ensure all uses (including the other instances at the
referenced locations) read from the same config struct so deployments can tune
RPC pressure without code changes.

Source: Coding guidelines

sergeytimoshin and others added 3 commits June 10, 2026 11:00
The api_server unconditionally spawned run_compressible_provider, which with no
in-memory trackers falls back to a full paginated getProgramAccounts scan every
30s. Even with compressible tracking disabled (the default), that heavy scan ties
up RPC pool connections and triggers "Failed to get RPC connection" pool failures
on mainnet.

Only spawn the provider when there is a cheap source — in-memory trackers
(compression enabled) or upstream forester APIs to aggregate. Otherwise skip it;
the dashboard simply reports no compressible counts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
forester/src/processor/v1/send_transaction.rs (1)

216-233: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reserve send permits by transaction count, not work-item count.

permits_to_reserve is derived from work_chunk.len(), but build_signed_transaction_batch(..., build_config) can collapse multiple work items into a single transaction when build_config.batch_size / legacy_ixs_per_tx is greater than 1. That over-reserves the process-wide semaphore, strands unused permits, and under contention can make a chunk return Ok(()) with 0 sends even though there was enough capacity for the chunk’s actual transaction count. Base the reservation on the number of transactions this chunk can emit (or at least div_ceil(work_chunk.len(), batch_size)) so the limiter matches real send concurrency.

Suggested direction
-                let permits_to_reserve = work_chunk.len().min(effective_max_concurrent_sends).max(1);
+                let tx_batch_size = usize::try_from(build_config.batch_size)
+                    .unwrap_or(usize::MAX)
+                    .max(1);
+                let permits_to_reserve = work_chunk
+                    .len()
+                    .div_ceil(tx_batch_size)
+                    .min(effective_max_concurrent_sends)
+                    .max(1);

Also applies to: 248-257

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@forester/src/processor/v1/send_transaction.rs` around lines 216 - 233, The
current code computes permits_to_reserve from work_chunk.len(), which
over-reserves when build_signed_transaction_batch (and its
build_config.batch_size / legacy_ixs_per_tx) can combine multiple work items
into fewer transactions; change the reservation logic used before calling
acquire_send_permits (and the identical logic in the other chunk where permits
are reserved) to compute the number of transactions instead, e.g.
div_ceil(work_chunk.len(), effective_batch_size) where effective_batch_size =
build_config.batch_size * legacy_ixs_per_tx (or the exact contraction logic used
by build_signed_transaction_batch), then
.min(effective_max_concurrent_sends).max(1) to produce permits_to_reserve so the
semaphore matches actual transaction concurrency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@forester/src/processor/v1/send_transaction.rs`:
- Around line 216-233: The current code computes permits_to_reserve from
work_chunk.len(), which over-reserves when build_signed_transaction_batch (and
its build_config.batch_size / legacy_ixs_per_tx) can combine multiple work items
into fewer transactions; change the reservation logic used before calling
acquire_send_permits (and the identical logic in the other chunk where permits
are reserved) to compute the number of transactions instead, e.g.
div_ceil(work_chunk.len(), effective_batch_size) where effective_batch_size =
build_config.batch_size * legacy_ixs_per_tx (or the exact contraction logic used
by build_signed_transaction_batch), then
.min(effective_max_concurrent_sends).max(1) to produce permits_to_reserve so the
semaphore matches actual transaction concurrency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0c062dbc-9528-4a41-9dde-0f98638a1d58

📥 Commits

Reviewing files that changed from the base of the PR and between 50271a5 and 2627db3.

📒 Files selected for processing (1)
  • forester/src/processor/v1/send_transaction.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant