evm-rpc: retry transient upstream-unavailable (-32503) errors instead of crashing the dumper#501
Open
elina-chertova wants to merge 1 commit into
Open
Conversation
… of crashing Aggregating RPC providers (e.g. uniblock) return a -32503 'service unavailable' error with the message 'Errors from the following providers prevented the request from being fulfilled' when all of their upstream providers momentarily fail. This is a transient availability error (the HTTP 503 analog), but EvmRpcClient.isConnectionError did not recognise it, so a single intermittent occurrence propagated out of eth_getBlockByNumber and crashed the dumper process, producing a restart/crash-loop. Treat it as a connection error so the existing retry+backoff machinery (the EVM dumper retries connection errors with retryAttempts=MAX_SAFE_INTEGER) rides over the blip, the same way rate-limit errors are already tolerated.
Contributor
|
change files are missing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cause (proven)
The
dump-hyperliquid-testnet-0dumper (namespaceevm-archive, imagesubsquid/evm-dump:cf85ec9c) was in a restart/crash-loop — 96 restarts in ~9h. Crash logs (kubectl logs --previous) show the process exiting on an unhandled RPC error from the uniblock aggregator endpoint (api.uniblock.dev/uni/v1/json-rpc?chainId=998):The error is intermittent — between crashes the dumper progresses normally (currently ingesting at ~13 blocks/sec), so the aggregator mostly succeeds and only occasionally returns
-32503when all of its upstream providers momentarily fail at once.EvmRpcClient.isConnectionError(evm/evm-rpc/src/rpc-client.ts) did not classify-32503as retryable: it isn't a rate-limit code, and it isn't-32000/-32603/"internal error". So the error escaped the retry machinery, propagated out ofgetBlocks→eth_getBlockByNumber, and crashed the process. This is the bug from the maintainer's earlier note — "why is it causing process crash if it's an intermittent error": the fatality is the defect, not the bad upstream response.Fix (tested)
Recognise the aggregator's transient
-32503"service unavailable" (...prevented the request from being fulfilled) as a connection error, alongside rate-limit errors. The EVM dumper already retries connection errors indefinitely with backoff (evm/evm-dump/src/dumper.tssetsretryAttempts: Number.MAX_SAFE_INTEGER), so the dumper now rides over the blip instead of crash-looping — the same tolerance already applied to rate limits. The same predicate also gatesisBatchRetryableError, so batch and single-call paths are both covered.Mechanism verified by tracing the live crash:
validateError→new RpcError({code:-32503})→client.receiveResultreject →isConnectionError === false→ permanent reject → process exit. With this changeisConnectionError === true→ re-enqueue + backoff.Falsification
-32503is persistent (e.g. dRPC stays out of balance and Alchemy stays down), the dumper will retry/stall instead of crash — progress halts and a writer-stall/no-progress alert fires. That's the intended degradation, but it means this code change does not, on its own, restore data flow when all upstream providers are durably down.-32503(rather than retrying), the classification is not taking effect.Operator follow-up (out of scope for this PR)
The trigger is a provider-side degradation: dRPC reports
User balance exceeded(billing depleted) and Alchemy is intermittently failing behind uniblock for chainId 998. Per policy a provider top-up/swap is an operator mitigation, not an autonomous PR — top up dRPC or repoint the hyperliquid-testnet upstream to a healthy provider. This PR makes the dumper survive the blip regardless.