fix(inventory): isolate per-bridge and per-chain failures in cross-chain transfer accounting#3484
fix(inventory): isolate per-bridge and per-chain failures in cross-chain transfer accounting#3484droplet-rl wants to merge 3 commits into
Conversation
…tandingCrossChainTransfers A transient outage in one bridge's read API (e.g. a 503 from the shared BridgeApi via TokenSplitterBridge) used to propagate up through BaseChainAdapter.getOutstandingCrossChainTransfers -> CrossChainTransferClient.update -> InventoryClient.update and crash the relayer/rebalancer process. Wrap each (l1Token, monitoredAddress) block in try/catch and log at error (at: "<adapterName>#getOutstandingCrossChainTransfers"). The inner Promise.all([queryL1, queryL2]) stays all-or-nothing because partial success there would skew outstanding = deposited - finalized; entries for the affected pair are simply absent from this cycle's result. Adds an isolation test reusing the existing SplitBridgeTracking scaffolding. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t.update A transient outage in one chain adapter's outstanding-transfer read used to crash InventoryClient.update via this aggregator's Promise.all. Switch to Promise.allSettled across chains: fulfilled chains overwrite with fresh state; rejected chains preserve their previously-recorded state instead of being blanked, biasing the InventoryClient toward under-rebalancing (safer) rather than duplicate-bridging on a single failed cycle. Per-chain failures log at error (at: "CrossChainTransferClient"). This is independent from and complementary to the per-bridge isolation in BaseChainAdapter — defense in depth for any uncaught path in the adapter layer. Adds focused unit tests for the isolation + stale-preservation behaviors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e2c202ae16
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // failed chains we preserve the previously-recorded state instead of blanking it — stale state | ||
| // biases the InventoryClient toward under-rebalancing (safer) rather than duplicate-bridging. | ||
| const settled = await Promise.allSettled( | ||
| chainIds.map((chainId) => this.adapterManager.getOutstandingCrossChainTransfers(chainId, l1Tokens)) |
There was a problem hiding this comment.
Wrap adapter calls so synchronous throws are settled
When AdapterManager.getOutstandingCrossChainTransfers throws before returning a promise, this map aborts before Promise.allSettled can turn the failure into a rejected result. That method does synchronous work first, including looking up the adapter and filtering tokens with getTokenInfo/adapter.supportedTokens, so a missing adapter or token-metadata/config error still makes CrossChainTransferClient.update reject and propagates back through InventoryClient.update instead of logging the failed chain and preserving stale state.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — AdapterManager.getOutstandingCrossChainTransfers does synchronous work (adapter lookup + getTokenInfo filter) before returning a promise, so a sync throw would abort the chainIds.map callback before Promise.allSettled could settle it. Fixed in 427fa45 by wrapping the callback in async, which turns sync throws into rejected promises that the per-chain failure path then logs and isolates. Added test/CrossChainTransferClient.update.ts coverage for the sync-throw case.
| message: `Failed to fetch outstanding transfers for ${monitoredAddress.toNative()} ${l1Token.toNative()}; skipping for this cycle`, | ||
| error: stringifyThrownValue(error), | ||
| }); | ||
| return; |
There was a problem hiding this comment.
Preserve stale bridge entries on partial adapter failures
When one bridge/address read fails here, the adapter returns a fulfilled but partial result; CrossChainTransferClient.update then overwrites the whole chain cache with that partial object because the chain-level promise did not reject. In a transient outage for a token that had outstanding transfers in the previous cycle, this removes that token's stale outstanding amount instead of preserving it, so inventory can treat the in-flight bridge as absent and submit a duplicate rebalance.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed — the partial result would clobber the previous (address, l1Token) entry on chain-level overwrite, biasing the InventoryClient toward a duplicate rebalance for the failed bridge. Fixed in 427fa45: CrossChainTransferClient.update now forwards the previously-recorded chain state into the adapter, and BaseChainAdapter.getOutstandingCrossChainTransfers copies the previous (address, l1Token) entry into the result when a bridge read fails. Healthy bridges still produce fresh state. Added a SplitBridgeTracking test asserting the failing bridge inherits the previous cycle entry verbatim while the healthy one surfaces fresh data.
…ale entries Addresses Codex review on #3484: P2: `AdapterManager.getOutstandingCrossChainTransfers` does synchronous work (adapter lookup, getTokenInfo / supportedTokens filtering) before returning a promise. A throw there aborted the `chainIds.map` callback before `Promise.allSettled` could see it. Wrap the callback in `async` so sync throws become rejected promises and stay isolated to the failed chain. P1: On a per-bridge failure the adapter previously returned a partial result omitting that `(address, l1Token)` pair. The chain-level overwrite in `CrossChainTransferClient.update` would then blank the stale outstanding amount — risking a duplicate rebalance. Pipe the previous chain state through `AdapterManager` → `BaseChainAdapter` and, on bridge-level catch, copy the previous `(address, l1Token)` entry into the new result. Tests: - `CrossChainTransferClient.update`: new test for sync-throw isolation. - `SplitBridgeTracking`: new test asserting the failing bridge inherits the previous cycle's entry while the healthy bridge surfaces fresh state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Addressed both Codex findings in 427fa45 and posted inline replies on each thread. P2 (sync throws escape P1 (partial result blanks stale entries) — Plumbed the previous chain state through Added two new tests (sync-throw isolation + per-bridge stale-entry preservation). Full suite: @codex please re-review. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 427fa4542e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const previousEntry = previousOutstandingTransfers?.[monitoredAddress.toNative()]?.[l1Token.toNative()]; | ||
| if (previousEntry !== undefined) { | ||
| assign(outstandingTransfers, [monitoredAddress.toNative(), l1Token.toNative()], previousEntry); | ||
| } |
There was a problem hiding this comment.
Fail closed when no stale bridge entry exists
When this catch runs before the client has a prior entry for the token/address (for example after a process restart or the first update cycle), the failing bridge is omitted from a fulfilled chain result. CrossChainTransferClient.update then overwrites the chain with that partial result, and InventoryClient.getBalanceOnChain treats the missing outstanding transfer as zero, so an actual in-flight rebalance can be duplicated during a bridge outage. Please propagate an unknown/failed state for this token or chain instead of returning success with the entry absent when previousEntry is undefined.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I considered this and want to push back. Tracing through the scenarios:
Warm restart (process crashed, all previousOutstandingTransfers = undefined, real in-flight rebalances pre-restart):
- Current code: bridge A succeeds with $5, bridge B fails → state = {A: $5, B: absent} → InventoryClient sees A=$5 (correct), B=0 (duplicate risk).
- Fail-closed-on-cold-start: A's fresh $5 thrown away → chain state preserved as undefined → InventoryClient sees A=0 and B=0 → duplicate risk for both.
Fail-closed is strictly worse here: it loses the healthy bridge's good read without removing duplicate risk for the failed bridge.
Steady state (previous entries exist for all bridges): bridge failure already preserves the previous entry — no regression either way.
Brand-new bridge added mid-run (one bridge has no previous entry because it's genuinely new): if that new bridge fails on its first cycle, current code reports 0 (correct — it's new, no pre-cycle in-flight). Fail-closed would discard the other bridges' fresh reads to "protect" against a risk that doesn't exist.
The fundamental issue is that InventoryClient.getOutstandingCrossChainTransferAmount returns a BigNumber with no "unknown" sentinel. Per-(token, address) failure propagation would require an InventoryClient interface change so it can defer the rebalance decision on unknown state — that's a meaningful follow-up, but escalating to chain-level on cold start in the meantime makes warm-restart cases worse, not better.
The duplicate-rebalance risk for a failed-read bridge with a true-but-unknown in-flight transfer pre-process is inherent and not solvable at this layer without that API change. Happy to track it as a follow-up if useful.
Summary
Companion to #3437 (rebalancer-client isolation). A transient outage in one bridge's or one chain adapter's outstanding-transfer read used to propagate up through
InventoryClient.updateand crash the relayer/rebalancer process. This PR isolates failures at two independent layers, in two commits with no inter-commit dependencies.Motivation
zion-across-fast-relayer-rebalancercrashed on:The error bottoms out in
InventoryClient.update, which has no catch around the cross-chain-transfer-accounting path.Commits
1.
fix(adapter): isolate per-bridge failures in BaseChainAdapter.getOutstandingCrossChainTransfersWraps each
(l1Token, monitoredAddress)block in try/catch. Logs aterror(at: \"<adapterName>#getOutstandingCrossChainTransfers\"). The innerPromise.all([queryL1, queryL2])stays all-or-nothing because partial success there would skewoutstanding = deposited − finalizedaccounting. Affected entries are absent from the cycle's result. Test reuses the existingSplitBridgeTrackingscaffolding (one newitblock + a new optionalqueryErrorarg on the existingMockTrackedBridge).2.
fix(inventory): isolate per-chain failures in CrossChainTransferClient.updatePromise.all→Promise.allSettledacross chains. Fulfilled chains overwrite with fresh state; rejected chains preserve their previously-recorded state instead of being blanked — stale state biases the InventoryClient toward under-rebalancing (safer) rather than duplicate-bridging during a single failed cycle. Logs aterror(at: \"CrossChainTransferClient\"). Defense in depth for any uncaught path inside the adapter layer.Either commit stands on its own; reverting one does not break the other.
Ops note
Add Datadog alerts on either of:
at: \"<adapterName>#getOutstandingCrossChainTransfers\"(substring#getOutstandingCrossChainTransfers) withlevel: errorat: \"CrossChainTransferClient\"withlevel: errorand message containingoutstanding cross chain transfersTest plan
tsc --noEmitcleaneslint+prettiercleanhardhat test test/CrossChainTransferClient.update.ts test/generic-adapters/SplitBridgeTracking.ts— 7/7 passing (5 existing + 2 new)hardhat test test/InventoryClient.InventoryRebalance.ts test/InventoryClient.RefundChain.ts test/Monitor.ts— 66/66 passing, no regressionsat:values🤖 Generated with Claude Code