fix(svm): resolve slot timestamps via getBlocks instead of per-slot walk by droplet-rl · Pull Request #1447 · across-protocol/sdk

droplet-rl · 2026-05-27T09:41:11Z

Summary

SVMSpokePoolClient.getTimestampForBlock() and getNearestSlotTime() recover from skipped slots by decrementing _slot one at a time and re-calling getBlockTime. When a target slot lands inside a real skipped-slot run on Solana mainnet, the cost compounds — every BundleDataClient.getBundleBlockTimestamps lookup × NODE_QUORUM providers × propose/validate/execute pass.
In a production dataworker run on 2026-05-27, the bundle boundary fell inside slots 422461540–422461547 (a real ~8-slot skip run on mainnet). The SVM executor primary kept calling getBlockTime for 57+ minutes against the same 8 slots before the serverless-orchestration hub's rejectSpokeDelay fired a retry that then raced the primary.
Replace the per-slot walk with findNearestProducedSlot(provider, targetSlot): a single getBlocks(lower, upper) call asks the RPC which slots in a 32-slot window produced a block, then iterates the window backwards in fixed steps if the initial range is empty.
findNearestProducedSlot is exported from arch/svm/SpokeUtils.ts and used by both SVMSpokePoolClient.getTimestampForBlock and getNearestSlotTime.
Also memoise per-block timestamp lookups inside SVMSpokePoolClient: bundle data loading hits the same boundary slots repeatedly across propose/validate/execute passes; storing the Promise (not the resolved value) also dedupes concurrent lookups.

Why this is safe

getBlocks is a standard Solana JSON-RPC method (GetBlocksApi in @solana/kit; in the spec since v1.7 / mid-2021 when getConfirmedBlocks was renamed). Verified live against api.mainnet-beta.solana.com and shape-checked against the kit type surface.
Benchmarked latency on mainnet-beta is flat ~75 ms up to ~1024 slot windows, +5–10 ms per doubling past that. A 32-slot window is one round-trip ~75 ms, vs the current path that issues 8+ getBlockTime RPCs × 2 providers per single resolution.
Iteration is bounded: maxIterations × window (default 32 × 32 = 1024 slots backward). If we still find nothing, the function returns undefined and the caller throws the same Unable to resolve time… error the old code did.
Function signatures of getTimestampForBlock and getNearestSlotTime are unchanged; this is a pure implementation swap.

Test plan

yarn lint-check clean on touched files.
yarn build (tsc) clean.
New unit test test/Solana.findNearestProducedSlot.ts covers: produced slot in initial window, iterate backwards when initial window is empty, return undefined when no produced slot inside the iteration bound, stop descending past slot 0.
CI to run the full suite under yarn test.

Follow-ups (separate PRs)

The relayer-side companion fixes for the SVM duplicate-execution race that this stall triggered are tracked separately in across-protocol/relayer.

🤖 Generated with Claude Code

SVMSpokePoolClient.getTimestampForBlock() and getNearestSlotTime() both walked backwards one slot at a time via getBlockTime() to recover from skipped slots. When a bundle boundary lands inside a skipped-slot run (observed on mainnet at slots 422461540-422461547), every BundleDataClient lookup for that boundary fans out into N getBlockTime() RPCs per call times NODE_QUORUM providers, repeated across every propose/validate/execute pass. In a production dataworker run on 2026-05-27 this kept the SVM executor primary alive for 57+ minutes making nothing but getBlockTime calls against the same 8-slot range, ultimately tripping the serverless hub's rejectSpokeDelay and triggering a retry that raced the primary. Replace the per-slot walk with findNearestProducedSlot(), which issues a single getBlocks() call to ask the RPC which slots in a 32-slot window produced a block, then iterates the window backwards in fixed steps if nothing is produced in the initial range. getBlocks() latency is flat (~75 ms) up to ~1024 slots on every provider we benchmarked, so the recovery cost is bounded regardless of skip-run length. Also memoise per-block timestamp lookups inside SVMSpokePoolClient so repeated boundary-slot queries across a single dataworker run resolve once. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 05a4e05372

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Attach a .catch() on the cached promise so a transient RPC failure in findNearestProducedSlot/getBlocks (or any other rejection from the resolver) clears the cache entry instead of poisoning getTimestampForBlock for the lifetime of the client. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

droplet-rl · 2026-05-27T09:54:09Z

Addressed both review comments in 73ae3a6:

pxrl (SpokeUtils.ts:248, "is it guaranteed that any elements in the response are actually blocks?") — replied inline. By the Solana RPC contract for getBlocks the response is a list of produced-block slots in [lower, upper] (ascending), typed as Slot[] in @solana/kit, so after the blocks.length > 0 gate blocks[blocks.length - 1] is guaranteed to be a produced slot. No code change needed.
codex (SVMSpokePoolClient.ts:205, P2, evict failed timestamp lookups) — fixed. Attached cached.catch(() => this.timestampForBlockCache.delete(slot)) when the promise is first stored, so any rejection (RPC error from findNearestProducedSlot/getBlocks, undefined producedSlot, undefined block time) evicts the entry. Dropped the now-redundant explicit delete() from _resolveTimestampForBlock's undefined-timestamp branch — the catch handles it.

yarn build:types and yarn hardhat test test/Solana.findNearestProducedSlot.ts both pass. Commit pushed via checkpoint.

- Trim verbose docstrings on findNearestProducedSlot and its defaults. - Loop uses ++i and `break` (over `return undefined`) so the function has a single trailing exit. - Invert the timestamp check in _resolveTimestampForBlock for a positive early return. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

droplet-rl · 2026-05-27T10:00:24Z

Addressed all five inline comments from this review in c612662:

if (lower === 0n) { break; } — applied.
for (let i = 0; i < maxIterations; ++i) — applied.
"Comments very verbose" — collapsed the block comments on findNearestProducedSlot and its defaults to one line each, and tightened the new .catch() comment in SVMSpokePoolClient.
Test path scoping — replied; there's no test/svm/ today, existing SVM tests use the Solana.* filename prefix at test/ root (Solana.setup.ts, Solana.EventData.test.ts, etc.) and the new file follows that convention. Offered to move them all as a follow-up if the team wants to standardise on subdirectories.
Invert isDefined(timestamp) check — applied (positive early return, throw falls through). The explicit timestampForBlockCache.delete(slot) from the suggestion was based on the pre-73ae3a6 code; the eviction is now handled by cached.catch(...) attached in getTimestampForBlock, so I left the redundant delete out and called that out in the reply.

yarn build:types and the findNearestProducedSlot tests still pass. Pushed via checkpoint.

The previous commit unconditionally issued getBlocks() + getBlockTime() for every slot lookup, regressing the common case (target slot has a block) from 1 RPC to 2. Restore the 1-RPC happy path in both SVMSpokePoolClient.getTimestampForBlock() and getNearestSlotTime() by probing getTimestampForSlot() for the target slot first, only falling through to findNearestProducedSlot() when the target itself was skipped. Also evict the per-block timestamp cache on rejection so transient RPC errors don't poison the cache for the rest of the run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

droplet-rl · 2026-05-27T10:09:20Z

⏳ Stopped 1 background subagent(s) at the 30m drain budget. Their results were not gathered:
• yarn install for sdk

✅ 3 background subagent(s) finished after the main reply:
• Run typecheck in sdk dir
• Run findNearestProducedSlot tests
• Background command "Reinstall" completed (exit code 0)

If you need the stopped results, ask again — the agent can re-run those subagents.

pxrl · 2026-05-27T10:09:57Z

@codex review

droplet-rl · 2026-05-27T10:10:34Z

No action needed from me — that comment is a @codex review invocation aimed at the chatgpt-codex-connector bot, which will post its own review pass. I'll wait for any actionable feedback from codex (or a follow-up from you) before making further changes.

chatgpt-codex-connector · 2026-05-27T10:13:47Z

Codex Review: Didn't find any major issues. Bravo.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

droplet-rl · 2026-05-27T10:14:23Z

Codex came back clean ("Didn't find any major issues") on the current HEAD (204db04). No further code changes warranted from this trigger. PR is ready for human review.

droplet-rl · 2026-05-27T10:41:02Z

Approval received from pxrl on review 4371702294 — no inline comments, empty body. PR is approved at HEAD 204db04. No code changes needed; ready for merge whenever you'd like.

droplet-rl · 2026-05-27T10:42:24Z

Question, not a change request — replied inline on the PR. Short version: JS is single-threaded so the get/set pair runs atomically inside one microtask; concurrent callers dedupe onto the same in-flight Promise. The only nuance is a small window between rejection settling and the attached .catch() evicting the entry, during which another caller can grab the rejected Promise — that's intentional (caller still observes the correct error, next post-eviction call triggers a fresh resolver). No additional code change.