forester: fix V1 nullify presort retry storm and reduce RPC pressure by sergeytimoshin · Pull Request #2382 · Lightprotocol/light-protocol

sergeytimoshin · 2026-06-08T10:30:38Z

Problem

V1 state/address nullification silently stopped sending transactions: queues had pending items, the forester fetched them (work_items=7), entered the send loop, then completed in microseconds with 0 sent — no proofs fetched, no tx built, no error. Queues never drained, even with healthy RPC, a caught-up indexer, and no OOM.

Root cause

The V1 send path runs an optional get_queue_leaf_indices presort step (gated only on enable_v1_multi_nullify, so effectively always on). On an indexer that doesn't implement the endpoint it returns 404 "Method not found", which the Photon client's retry wrapper treated as a retryable ApiError and retried 10× with exponential backoff:

400 + 800 + 1600 + 3200 + 6400 + 8000×4 ≈ 44s

timeout_deadline is anchored at function_start (a slot-derived budget of ~20s), so by the time the ~44s presort returned, the send chunk aborted at the Instant::now() >= timeout_deadline guard before doing any work. Confirmed with instrumentation: chunk entry showed expired=true, overdue=~22-35s.

Changes

photon_indexer: an ApiError whose message contains "method not found" / "status 404" is now non-retryable — fail fast instead of ~44s of backoff. Affects any caller hitting a missing endpoint.
--enable-v1-presort (env ENABLE_V1_PRESORT), default off: the get_queue_leaf_indices presort path no longer runs unless explicitly enabled (best-effort optimization; requires an indexer that implements it).
Adaptive V2 poll: replaced the fixed 200ms queue poll with backoff 200ms → 10s (reset to 200ms when work is found). Idle V2 trees were re-fetching the queue ~5×/sec for the entire eligible window — the dominant source of wasted RPC/indexer load. Throughput while draining is unaffected (the loop still re-runs immediately when items are found).
Default --rpc-pool-size 100 → 32.

Verification

Built from this branch and run on devnet + mainnet foresters: V1 nullify resumed (tx sent … type=StateV1MultiNullify), queues drained 7 → 0, transactions finalized on-chain (err=None), and 0 presort calls with the flag off.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- CLI flag to enable best-effort pre-sorting of V1 work items.
- Config option exposing the new pre-sort toggle.
Improvements
- Reduced default RPC pool size from 100 to 32.
- Global concurrency limiting for V1 transaction sends with bounded batching.
- Rate-limited transaction resends to reduce retry load.
- Adaptive queue polling with exponential backoff when idle.
- Indexer error handling skips retries for certain non-retryable responses.
- Background compressible provider spawned only when a cheap source exists.

The V1 send path's optional get_queue_leaf_indices "presort" was effectively always on (gated only by enable_v1_multi_nullify). On an indexer that does not implement the endpoint it returns 404, which the Photon client treated as a retryable ApiError and backed off ~44s (10 retries) per send. That consumed the entire per-slot send budget, so the chunk aborted at the deadline guard before building/sending any transaction — V1 queues never drained. Changes: - photon_indexer: treat an ApiError whose message contains "method not found" or "status 404" as non-retryable (fail fast instead of ~44s backoff). - forester: add `--enable-v1-presort` (env ENABLE_V1_PRESORT), default off, and gate the get_queue_leaf_indices presort path on it. - forester: replace the fixed 200ms V2 queue poll with adaptive backoff (200ms -> 10s cap, reset to 200ms when work is found) to stop idle V2 trees from re-fetching the queue ~5x/sec for the whole eligible window. - forester: lower default --rpc-pool-size 100 -> 32. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-08T10:30:51Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3939782a-e251-43b2-9330-a83444207c74

📥 Commits

Reviewing files that changed from the base of the PR and between 2627db3 and aa948ea.

📒 Files selected for processing (5)

forester/tests/e2e_test.rs
forester/tests/legacy/priority_fee_test.rs
forester/tests/legacy/test_utils.rs
forester/tests/priority_fee_test.rs
forester/tests/test_utils.rs

📝 Walkthrough

Walkthrough

Adds a CLI/config toggle for optional V1 presort and lowers the RPC pool default to 32; introduces a process-wide semaphore to limit V1 send concurrency, adaptive backoff for V2 polling, conditional compressible-provider spawn, smart-transaction resend rate-limiting, and indexer 404/method-not-found non-retry classification.

Changes

Forester runtime and send-path updates

Layer / File(s)	Summary
V1 Presort CLI and Configuration Wiring `forester/src/cli.rs`, `forester/src/config.rs`	Reduces RPC pool default from 100→32, adds `--enable-v1-presort`/`ENABLE_V1_PRESORT`, and wires `ForesterConfig.enable_v1_presort` into `new_for_start`, `new_for_status` (false), and Clone.
Photon indexer: non-retryable 404/method-not-found `sdk-libs/client/src/indexer/photon_indexer.rs`	Classifies `IndexerError::ApiError` messages containing “method not found” or “status 404” as non-retryable and logs accordingly; other API errors continue to be retried.
Conditional compressible provider spawn `forester/src/api_server.rs`	`run_compressible_provider` is spawned only if in-memory trackers exist or `config.forester_api_urls` is non-empty.
EpochManager: semaphore, presort toggle, adaptive polling `forester/src/epoch_manager.rs`	Adds process-wide `v1_send_permits: Arc<Semaphore>` initialized from `transaction_config.max_concurrent_sends`, passes it into V1 batching, sets `enable_presort` from config, and replaces fixed 200ms idle sleep with exponential-backoff polling (200ms doubling to 10s, reset on work).
V1 send: chunking, permit acquisition, bounded concurrency `forester/src/processor/v1/send_transaction.rs`	Introduces shared `send_permits` semaphore, computes effective batch sizes, pre-acquires per-chunk permits with deadline-bound acquisition, limits concurrent chunks, uses `buffer_unordered` for results, and maps permit timeouts/closed-limiter to explicit send result variants.
Smart-transaction resend rate-limiting `forester/src/smart_transaction.rs`	Adds `MIN_TRANSACTION_RESEND_INTERVAL`, computes resend interval from poll interval (×4 with minimum), and only invokes send when scheduled `next_send_at` is reached; advances schedule after each attempt.

Sequence Diagram

sequenceDiagram
  participant EpochManager
  participant SendBatchedTransactions
  participant V1_Send_Semaphore
  participant RPCNode
  EpochManager->>SendBatchedTransactions: request send batch (v1)
  SendBatchedTransactions->>V1_Send_Semaphore: try acquire N permits (deadline-bound)
  V1_Send_Semaphore-->>SendBatchedTransactions: permits granted / timeout / closed
  SendBatchedTransactions->>RPCNode: submit transactions (bounded concurrency)
  RPCNode-->>SendBatchedTransactions: send result / error
  SendBatchedTransactions-->>EpochManager: aggregated chunk results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Lightprotocol/light-protocol#2163: Also modifies forester/src/api_server.rs startup/background task behavior.

Suggested labels

ai-review

Suggested reviewers

SwenSchaeferjohann
ananas-block

Poem

Semaphore guards the V1 parade,
Polling learns to pause, then trade,
Presort waits behind a flag so small,
404s no longer call for thrall,
RPC trimmed down — a leaner hall.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main objectives: fixing V1 nullify presort retry storms and reducing RPC pressure through several coordinated changes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 70.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch sergey/forester-rpc-efficiency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 412b3356de

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-08T10:33:14Z

+                        tokio::time::sleep(poll_interval).await;
+                        poll_interval = (poll_interval * 2).min(POLL_INTERVAL_MAX);


Cap idle V2 polling to the light-slot budget

When a V2 tree is idle early in its eligible light slot, this sleep backs off without considering how much of the light slot remains. I checked the protocol defaults (programs/registry/src/protocol_config/state.rs): local/default slot_length is 10 Solana slots (~4s) and testnet is 60 (~24s), so after several empty polls the forester can sleep 6.4s/10s, wake after forester_slot_details.end_solana_slot, and skip work that arrived during the sleep until a later eligibility window. Cap the sleep/backoff by the remaining light-slot time or keep it well below the slot length so late-arriving V2 queue items can still be processed in the current slot.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@forester/src/epoch_manager.rs`:
- Around line 2268-2275: POLL_INTERVAL_MIN and POLL_INTERVAL_MAX are hardcoded;
make them configurable via CLI and environment variables and use those values
wherever the constants are referenced (e.g., POLL_INTERVAL_MIN,
POLL_INTERVAL_MAX, and the local poll_interval variable in epoch_manager.rs and
the other occurrences you noted). Add config fields (e.g.,
v2_poll_interval_min_ms and v2_poll_interval_max_ms) to the application
Config/opts (supporting both CLI flags and env vars), parse them into Durations
(defaulting to Duration::from_millis(200) and Duration::from_secs(10) when
unspecified), and replace the constants/inline literals with values from that
Config (ensure validation: min <= max and reasonable bounds). Ensure all uses
(including the other instances at the referenced locations) read from the same
config struct so deployments can tune RPC pressure without code changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 22decbc8-303a-44d0-aa19-6feb45d77c75

📥 Commits

Reviewing files that changed from the base of the PR and between 5d58e11 and 412b335.

📒 Files selected for processing (4)

forester/src/cli.rs
forester/src/config.rs
forester/src/epoch_manager.rs
sdk-libs/client/src/indexer/photon_indexer.rs

coderabbitai · 2026-06-08T10:35:46Z

+        // Adaptive queue polling: start responsive, then back off (capped) while the
+        // queue has nothing ready to process, and reset to the minimum as soon as work
+        // is found. A fixed 200ms poll made idle V2 trees re-fetch the queue ~5x/sec for
+        // the whole eligible window, which is the dominant source of wasted RPC/indexer
+        // load (and can exhaust a shared RPC credit budget).
+        const POLL_INTERVAL_MIN: Duration = Duration::from_millis(200);
+        const POLL_INTERVAL_MAX: Duration = Duration::from_secs(10);
+        let mut poll_interval = POLL_INTERVAL_MIN;


🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Make V2 adaptive polling bounds configurable via CLI/env.

Line 2273 and Line 2274 hardcode POLL_INTERVAL_MIN/POLL_INTERVAL_MAX. These are operational tuning knobs and should be externally configurable so deployments can tune RPC pressure vs responsiveness without code changes.

As per coding guidelines, “Use environment variables for configuration instead of hardcoded values” and “Support both CLI arguments and environment variables for all configuration options.”

Also applies to: 2346-2349, 2362-2363

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@forester/src/epoch_manager.rs` around lines 2268 - 2275, POLL_INTERVAL_MIN and POLL_INTERVAL_MAX are hardcoded; make them configurable via CLI and environment variables and use those values wherever the constants are referenced (e.g., POLL_INTERVAL_MIN, POLL_INTERVAL_MAX, and the local poll_interval variable in epoch_manager.rs and the other occurrences you noted). Add config fields (e.g., v2_poll_interval_min_ms and v2_poll_interval_max_ms) to the application Config/opts (supporting both CLI flags and env vars), parse them into Durations (defaulting to Duration::from_millis(200) and Duration::from_secs(10) when unspecified), and replace the constants/inline literals with values from that Config (ensure validation: min <= max and reasonable bounds). Ensure all uses (including the other instances at the referenced locations) read from the same config struct so deployments can tune RPC pressure without code changes.

Source: Coding guidelines

The api_server unconditionally spawned run_compressible_provider, which with no in-memory trackers falls back to a full paginated getProgramAccounts scan every 30s. Even with compressible tracking disabled (the default), that heavy scan ties up RPC pool connections and triggers "Failed to get RPC connection" pool failures on mainnet. Only spawn the provider when there is a cheap source — in-memory trackers (compression enabled) or upstream forester APIs to aggregate. Otherwise skip it; the dashboard simply reports no compressible counts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

forester/src/processor/v1/send_transaction.rs (1)
216-233: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reserve send permits by transaction count, not work-item count.

permits_to_reserve is derived from work_chunk.len(), but build_signed_transaction_batch(..., build_config) can collapse multiple work items into a single transaction when build_config.batch_size / legacy_ixs_per_tx is greater than 1. That over-reserves the process-wide semaphore, strands unused permits, and under contention can make a chunk return Ok(()) with 0 sends even though there was enough capacity for the chunk’s actual transaction count. Base the reservation on the number of transactions this chunk can emit (or at least div_ceil(work_chunk.len(), batch_size)) so the limiter matches real send concurrency.
Suggested direction
-                let permits_to_reserve = work_chunk.len().min(effective_max_concurrent_sends).max(1);
+                let tx_batch_size = usize::try_from(build_config.batch_size)
+                    .unwrap_or(usize::MAX)
+                    .max(1);
+                let permits_to_reserve = work_chunk
+                    .len()
+                    .div_ceil(tx_batch_size)
+                    .min(effective_max_concurrent_sends)
+                    .max(1);
Also applies to: 248-257
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@forester/src/processor/v1/send_transaction.rs` around lines 216 - 233, The
current code computes permits_to_reserve from work_chunk.len(), which
over-reserves when build_signed_transaction_batch (and its
build_config.batch_size / legacy_ixs_per_tx) can combine multiple work items
into fewer transactions; change the reservation logic used before calling
acquire_send_permits (and the identical logic in the other chunk where permits
are reserved) to compute the number of transactions instead, e.g.
div_ceil(work_chunk.len(), effective_batch_size) where effective_batch_size =
build_config.batch_size * legacy_ixs_per_tx (or the exact contraction logic used
by build_signed_transaction_batch), then
.min(effective_max_concurrent_sends).max(1) to produce permits_to_reserve so the
semaphore matches actual transaction concurrency.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@forester/src/processor/v1/send_transaction.rs`:
- Around line 216-233: The current code computes permits_to_reserve from
work_chunk.len(), which over-reserves when build_signed_transaction_batch (and
its build_config.batch_size / legacy_ixs_per_tx) can combine multiple work items
into fewer transactions; change the reservation logic used before calling
acquire_send_permits (and the identical logic in the other chunk where permits
are reserved) to compute the number of transactions instead, e.g.
div_ceil(work_chunk.len(), effective_batch_size) where effective_batch_size =
build_config.batch_size * legacy_ixs_per_tx (or the exact contraction logic used
by build_signed_transaction_batch), then
.min(effective_max_concurrent_sends).max(1) to produce permits_to_reserve so the
semaphore matches actual transaction concurrency.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0c062dbc-9528-4a41-9dde-0f98638a1d58

📥 Commits

Reviewing files that changed from the base of the PR and between 50271a5 and 2627db3.

📒 Files selected for processing (1)

forester/src/processor/v1/send_transaction.rs

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

sergeytimoshin and others added 3 commits June 10, 2026 11:00

refetch blockhash for v1 chunk

50271a5

Fix atomic V1 send permit reservation

2627db3

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

fix tests

aa948ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

forester: fix V1 nullify presort retry storm and reduce RPC pressure#2382

forester: fix V1 nullify presort retry storm and reduce RPC pressure#2382
sergeytimoshin wants to merge 5 commits into
mainfrom
sergey/forester-rpc-efficiency

sergeytimoshin commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		tokio::time::sleep(poll_interval).await;
		poll_interval = (poll_interval * 2).min(POLL_INTERVAL_MAX);

Conversation

sergeytimoshin commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Changes

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sergeytimoshin commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading