Skip to content

evm-rpc: retry transient upstream-unavailable (-32503) errors instead of crashing the dumper#501

Open
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/67d4hv-evm-rpc-upstream-unavailable
Open

evm-rpc: retry transient upstream-unavailable (-32503) errors instead of crashing the dumper#501
elina-chertova wants to merge 1 commit into
masterfrom
alert-fix/67d4hv-evm-rpc-upstream-unavailable

Conversation

@elina-chertova

Copy link
Copy Markdown
Contributor

Cause (proven)

The dump-hyperliquid-testnet-0 dumper (namespace evm-archive, image subsquid/evm-dump:cf85ec9c) was in a restart/crash-loop — 96 restarts in ~9h. Crash logs (kubectl logs --previous) show the process exiting on an unhandled RPC error from the uniblock aggregator endpoint (api.uniblock.dev/uni/v1/json-rpc?chainId=998):

RpcError: Errors from the following providers prevented the request from being fulfilled: dRPC, Alchemy.
  code: -32503
  data: { DRPC: { error: { code: 10, message: "User balance exceeded" } },
          Alchemy: { error: { code: -32001, message: "Unable to complete request at this time." } } }
  rpcMethod: eth_getBlockByNumber
  at validateError (evm/evm-rpc/lib/rpc.js)
  at EvmRpcClient.receiveResult (util/rpc-client/lib/client.js)

The error is intermittent — between crashes the dumper progresses normally (currently ingesting at ~13 blocks/sec), so the aggregator mostly succeeds and only occasionally returns -32503 when all of its upstream providers momentarily fail at once.

EvmRpcClient.isConnectionError (evm/evm-rpc/src/rpc-client.ts) did not classify -32503 as retryable: it isn't a rate-limit code, and it isn't -32000/-32603/"internal error". So the error escaped the retry machinery, propagated out of getBlockseth_getBlockByNumber, and crashed the process. This is the bug from the maintainer's earlier note — "why is it causing process crash if it's an intermittent error": the fatality is the defect, not the bad upstream response.

Fix (tested)

Recognise the aggregator's transient -32503 "service unavailable" (...prevented the request from being fulfilled) as a connection error, alongside rate-limit errors. The EVM dumper already retries connection errors indefinitely with backoff (evm/evm-dump/src/dumper.ts sets retryAttempts: Number.MAX_SAFE_INTEGER), so the dumper now rides over the blip instead of crash-looping — the same tolerance already applied to rate limits. The same predicate also gates isBatchRetryableError, so batch and single-call paths are both covered.

Mechanism verified by tracing the live crash: validateErrornew RpcError({code:-32503})client.receiveResult reject → isConnectionError === false → permanent reject → process exit. With this change isConnectionError === true → re-enqueue + backoff.

Falsification

  • If -32503 is persistent (e.g. dRPC stays out of balance and Alchemy stays down), the dumper will retry/stall instead of crash — progress halts and a writer-stall/no-progress alert fires. That's the intended degradation, but it means this code change does not, on its own, restore data flow when all upstream providers are durably down.
  • If after deploy the dumper still exits with a non-zero code on -32503 (rather than retrying), the classification is not taking effect.

Operator follow-up (out of scope for this PR)

The trigger is a provider-side degradation: dRPC reports User balance exceeded (billing depleted) and Alchemy is intermittently failing behind uniblock for chainId 998. Per policy a provider top-up/swap is an operator mitigation, not an autonomous PR — top up dRPC or repoint the hyperliquid-testnet upstream to a healthy provider. This PR makes the dumper survive the blip regardless.

… of crashing

Aggregating RPC providers (e.g. uniblock) return a -32503 'service
unavailable' error with the message 'Errors from the following providers
prevented the request from being fulfilled' when all of their upstream
providers momentarily fail. This is a transient availability error (the
HTTP 503 analog), but EvmRpcClient.isConnectionError did not recognise it,
so a single intermittent occurrence propagated out of eth_getBlockByNumber
and crashed the dumper process, producing a restart/crash-loop.

Treat it as a connection error so the existing retry+backoff machinery
(the EVM dumper retries connection errors with retryAttempts=MAX_SAFE_INTEGER)
rides over the blip, the same way rate-limit errors are already tolerated.
@tmcgroul

Copy link
Copy Markdown
Contributor

change files are missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants