evm-rpc: retry transient upstream-unavailable (-32503) errors instead of crashing the dumper by elina-chertova · Pull Request #501 · subsquid/squid-sdk

elina-chertova · 2026-06-19T21:45:32Z

Cause (proven)

The dump-hyperliquid-testnet-0 dumper (namespace evm-archive, image subsquid/evm-dump:cf85ec9c) was in a restart/crash-loop — 96 restarts in ~9h. Crash logs (kubectl logs --previous) show the process exiting on an unhandled RPC error from the uniblock aggregator endpoint (api.uniblock.dev/uni/v1/json-rpc?chainId=998):

RpcError: Errors from the following providers prevented the request from being fulfilled: dRPC, Alchemy.
  code: -32503
  data: { DRPC: { error: { code: 10, message: "User balance exceeded" } },
          Alchemy: { error: { code: -32001, message: "Unable to complete request at this time." } } }
  rpcMethod: eth_getBlockByNumber
  at validateError (evm/evm-rpc/lib/rpc.js)
  at EvmRpcClient.receiveResult (util/rpc-client/lib/client.js)

The error is intermittent — between crashes the dumper progresses normally (currently ingesting at ~13 blocks/sec), so the aggregator mostly succeeds and only occasionally returns -32503 when all of its upstream providers momentarily fail at once.

EvmRpcClient.isConnectionError (evm/evm-rpc/src/rpc-client.ts) did not classify -32503 as retryable: it isn't a rate-limit code, and it isn't -32000/-32603/"internal error". So the error escaped the retry machinery, propagated out of getBlocks → eth_getBlockByNumber, and crashed the process. This is the bug from the maintainer's earlier note — "why is it causing process crash if it's an intermittent error": the fatality is the defect, not the bad upstream response.

Fix (tested)

Recognise the aggregator's transient -32503 "service unavailable" (...prevented the request from being fulfilled) as a connection error, alongside rate-limit errors. The EVM dumper already retries connection errors indefinitely with backoff (evm/evm-dump/src/dumper.ts sets retryAttempts: Number.MAX_SAFE_INTEGER), so the dumper now rides over the blip instead of crash-looping — the same tolerance already applied to rate limits. The same predicate also gates isBatchRetryableError, so batch and single-call paths are both covered.

Mechanism verified by tracing the live crash: validateError → new RpcError({code:-32503}) → client.receiveResult reject → isConnectionError === false → permanent reject → process exit. With this change isConnectionError === true → re-enqueue + backoff.

Falsification

If -32503 is persistent (e.g. dRPC stays out of balance and Alchemy stays down), the dumper will retry/stall instead of crash — progress halts and a writer-stall/no-progress alert fires. That's the intended degradation, but it means this code change does not, on its own, restore data flow when all upstream providers are durably down.
If after deploy the dumper still exits with a non-zero code on -32503 (rather than retrying), the classification is not taking effect.

Operator follow-up (out of scope for this PR)

The trigger is a provider-side degradation: dRPC reports User balance exceeded (billing depleted) and Alchemy is intermittently failing behind uniblock for chainId 998. Per policy a provider top-up/swap is an operator mitigation, not an autonomous PR — top up dRPC or repoint the hyperliquid-testnet upstream to a healthy provider. This PR makes the dumper survive the blip regardless.

… of crashing Aggregating RPC providers (e.g. uniblock) return a -32503 'service unavailable' error with the message 'Errors from the following providers prevented the request from being fulfilled' when all of their upstream providers momentarily fail. This is a transient availability error (the HTTP 503 analog), but EvmRpcClient.isConnectionError did not recognise it, so a single intermittent occurrence propagated out of eth_getBlockByNumber and crashed the dumper process, producing a restart/crash-loop. Treat it as a connection error so the existing retry+backoff machinery (the EVM dumper retries connection errors with retryAttempts=MAX_SAFE_INTEGER) rides over the blip, the same way rate-limit errors are already tolerated.

tmcgroul · 2026-06-29T16:05:33Z

change files are missing

elina-chertova mentioned this pull request Jun 23, 2026

evm-rpc: retry "response too large" (-32020) instead of crash-looping the dumper #505

Open

tmcgroul mentioned this pull request Jun 29, 2026

rpc-client: retry transient 'no available provider' errors instead of crashing #504

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

evm-rpc: retry transient upstream-unavailable (-32503) errors instead of crashing the dumper#501

evm-rpc: retry transient upstream-unavailable (-32503) errors instead of crashing the dumper#501
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/67d4hv-evm-rpc-upstream-unavailable

elina-chertova commented Jun 19, 2026

Uh oh!

tmcgroul commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

elina-chertova commented Jun 19, 2026

Cause (proven)

Fix (tested)

Falsification

Operator follow-up (out of scope for this PR)

Uh oh!

tmcgroul commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants