Skip to content

fix(inventory): isolate per-bridge and per-chain failures in cross-chain transfer accounting#3484

Open
droplet-rl wants to merge 3 commits into
masterfrom
droplet/T90K0AL22-C03GHT4RV42-1781175656-993129
Open

fix(inventory): isolate per-bridge and per-chain failures in cross-chain transfer accounting#3484
droplet-rl wants to merge 3 commits into
masterfrom
droplet/T90K0AL22-C03GHT4RV42-1781175656-993129

Conversation

@droplet-rl

Copy link
Copy Markdown
Contributor

Summary

Companion to #3437 (rebalancer-client isolation). A transient outage in one bridge's or one chain adapter's outstanding-transfer read used to propagate up through InventoryClient.update and crash the relayer/rebalancer process. This PR isolates failures at two independent layers, in two commits with no inter-commit dependencies.

Motivation

zion-across-fast-relayer-rebalancer crashed on:

HttpError: HTTP 503: Service Unavailable
  at BridgeApiClient.getWithRetry
  at BridgeApi.queryL1BridgeInitiationEvents
  at TokenSplitterBridge.queryL1BridgeInitiationEvents
  at BaseChainAdapter (getOutstandingCrossChainTransfers)

The error bottoms out in InventoryClient.update, which has no catch around the cross-chain-transfer-accounting path.

Commits

1. fix(adapter): isolate per-bridge failures in BaseChainAdapter.getOutstandingCrossChainTransfers

Wraps each (l1Token, monitoredAddress) block in try/catch. Logs at error (at: \"<adapterName>#getOutstandingCrossChainTransfers\"). The inner Promise.all([queryL1, queryL2]) stays all-or-nothing because partial success there would skew outstanding = deposited − finalized accounting. Affected entries are absent from the cycle's result. Test reuses the existing SplitBridgeTracking scaffolding (one new it block + a new optional queryError arg on the existing MockTrackedBridge).

2. fix(inventory): isolate per-chain failures in CrossChainTransferClient.update

Promise.allPromise.allSettled across chains. Fulfilled chains overwrite with fresh state; rejected chains preserve their previously-recorded state instead of being blanked — stale state biases the InventoryClient toward under-rebalancing (safer) rather than duplicate-bridging during a single failed cycle. Logs at error (at: \"CrossChainTransferClient\"). Defense in depth for any uncaught path inside the adapter layer.

Either commit stands on its own; reverting one does not break the other.

Ops note

Add Datadog alerts on either of:

  • at: \"<adapterName>#getOutstandingCrossChainTransfers\" (substring #getOutstandingCrossChainTransfers) with level: error
  • at: \"CrossChainTransferClient\" with level: error and message containing outstanding cross chain transfers

Test plan

  • tsc --noEmit clean
  • eslint + prettier clean
  • hardhat test test/CrossChainTransferClient.update.ts test/generic-adapters/SplitBridgeTracking.ts — 7/7 passing (5 existing + 2 new)
  • hardhat test test/InventoryClient.InventoryRebalance.ts test/InventoryClient.RefundChain.ts test/Monitor.ts — 66/66 passing, no regressions
  • Operator: wire Datadog alerts on the two new error-log at: values

🤖 Generated with Claude Code

droplet-rl and others added 2 commits June 11, 2026 11:34
…tandingCrossChainTransfers

A transient outage in one bridge's read API (e.g. a 503 from the shared
BridgeApi via TokenSplitterBridge) used to propagate up through
BaseChainAdapter.getOutstandingCrossChainTransfers -> CrossChainTransferClient.update
-> InventoryClient.update and crash the relayer/rebalancer process. Wrap each
(l1Token, monitoredAddress) block in try/catch and log at error
(at: "<adapterName>#getOutstandingCrossChainTransfers"). The inner
Promise.all([queryL1, queryL2]) stays all-or-nothing because partial success
there would skew outstanding = deposited - finalized; entries for the affected
pair are simply absent from this cycle's result.

Adds an isolation test reusing the existing SplitBridgeTracking scaffolding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t.update

A transient outage in one chain adapter's outstanding-transfer read used to
crash InventoryClient.update via this aggregator's Promise.all. Switch to
Promise.allSettled across chains: fulfilled chains overwrite with fresh state;
rejected chains preserve their previously-recorded state instead of being
blanked, biasing the InventoryClient toward under-rebalancing (safer) rather
than duplicate-bridging on a single failed cycle. Per-chain failures log at
error (at: "CrossChainTransferClient").

This is independent from and complementary to the per-bridge isolation in
BaseChainAdapter — defense in depth for any uncaught path in the adapter
layer. Adds focused unit tests for the isolation + stale-preservation
behaviors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2c202ae16

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// failed chains we preserve the previously-recorded state instead of blanking it — stale state
// biases the InventoryClient toward under-rebalancing (safer) rather than duplicate-bridging.
const settled = await Promise.allSettled(
chainIds.map((chainId) => this.adapterManager.getOutstandingCrossChainTransfers(chainId, l1Tokens))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wrap adapter calls so synchronous throws are settled

When AdapterManager.getOutstandingCrossChainTransfers throws before returning a promise, this map aborts before Promise.allSettled can turn the failure into a rejected result. That method does synchronous work first, including looking up the adapter and filtering tokens with getTokenInfo/adapter.supportedTokens, so a missing adapter or token-metadata/config error still makes CrossChainTransferClient.update reject and propagates back through InventoryClient.update instead of logging the failed chain and preserving stale state.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — AdapterManager.getOutstandingCrossChainTransfers does synchronous work (adapter lookup + getTokenInfo filter) before returning a promise, so a sync throw would abort the chainIds.map callback before Promise.allSettled could settle it. Fixed in 427fa45 by wrapping the callback in async, which turns sync throws into rejected promises that the per-chain failure path then logs and isolates. Added test/CrossChainTransferClient.update.ts coverage for the sync-throw case.

message: `Failed to fetch outstanding transfers for ${monitoredAddress.toNative()} ${l1Token.toNative()}; skipping for this cycle`,
error: stringifyThrownValue(error),
});
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve stale bridge entries on partial adapter failures

When one bridge/address read fails here, the adapter returns a fulfilled but partial result; CrossChainTransferClient.update then overwrites the whole chain cache with that partial object because the chain-level promise did not reject. In a transient outage for a token that had outstanding transfers in the previous cycle, this removes that token's stale outstanding amount instead of preserving it, so inventory can treat the in-flight bridge as absent and submit a duplicate rebalance.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the partial result would clobber the previous (address, l1Token) entry on chain-level overwrite, biasing the InventoryClient toward a duplicate rebalance for the failed bridge. Fixed in 427fa45: CrossChainTransferClient.update now forwards the previously-recorded chain state into the adapter, and BaseChainAdapter.getOutstandingCrossChainTransfers copies the previous (address, l1Token) entry into the result when a bridge read fails. Healthy bridges still produce fresh state. Added a SplitBridgeTracking test asserting the failing bridge inherits the previous cycle entry verbatim while the healthy one surfaces fresh data.

…ale entries

Addresses Codex review on #3484:

P2: `AdapterManager.getOutstandingCrossChainTransfers` does synchronous work
(adapter lookup, getTokenInfo / supportedTokens filtering) before returning a
promise. A throw there aborted the `chainIds.map` callback before
`Promise.allSettled` could see it. Wrap the callback in `async` so sync throws
become rejected promises and stay isolated to the failed chain.

P1: On a per-bridge failure the adapter previously returned a partial result
omitting that `(address, l1Token)` pair. The chain-level overwrite in
`CrossChainTransferClient.update` would then blank the stale outstanding
amount — risking a duplicate rebalance. Pipe the previous chain state through
`AdapterManager` → `BaseChainAdapter` and, on bridge-level catch, copy the
previous `(address, l1Token)` entry into the new result.

Tests:
- `CrossChainTransferClient.update`: new test for sync-throw isolation.
- `SplitBridgeTracking`: new test asserting the failing bridge inherits the
  previous cycle's entry while the healthy bridge surfaces fresh state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@droplet-rl

Copy link
Copy Markdown
Contributor Author

Addressed both Codex findings in 427fa45 and posted inline replies on each thread.

P2 (sync throws escape Promise.allSettled)AdapterManager.getOutstandingCrossChainTransfers does sync work (adapter lookup + getTokenInfo filter) before returning a promise. Wrapped the chainIds.map callback in async so sync throws become rejected promises and route through the per-chain failure path.

P1 (partial result blanks stale entries) — Plumbed the previous chain state through CrossChainTransferClient.updateAdapterManagerBaseChainAdapter.getOutstandingCrossChainTransfers. On a bridge-level catch, the adapter now copies the previous (address, l1Token) entry into the result, so a transient failure on one bridge no longer drops the in-flight rebalance amount for that token. Healthy bridges still surface fresh state.

Added two new tests (sync-throw isolation + per-bridge stale-entry preservation). Full suite: tsc clean, prettier clean, hardhat test test/CrossChainTransferClient.update.ts test/generic-adapters/SplitBridgeTracking.ts 9/9, and test/InventoryClient.InventoryRebalance.ts test/InventoryClient.RefundChain.ts test/Monitor.ts 66/66.

@codex please re-review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 427fa4542e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +585 to +588
const previousEntry = previousOutstandingTransfers?.[monitoredAddress.toNative()]?.[l1Token.toNative()];
if (previousEntry !== undefined) {
assign(outstandingTransfers, [monitoredAddress.toNative(), l1Token.toNative()], previousEntry);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail closed when no stale bridge entry exists

When this catch runs before the client has a prior entry for the token/address (for example after a process restart or the first update cycle), the failing bridge is omitted from a fulfilled chain result. CrossChainTransferClient.update then overwrites the chain with that partial result, and InventoryClient.getBalanceOnChain treats the missing outstanding transfer as zero, so an actual in-flight rebalance can be duplicated during a bridge outage. Please propagate an unknown/failed state for this token or chain instead of returning success with the entry absent when previousEntry is undefined.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this and want to push back. Tracing through the scenarios:

Warm restart (process crashed, all previousOutstandingTransfers = undefined, real in-flight rebalances pre-restart):

  • Current code: bridge A succeeds with $5, bridge B fails → state = {A: $5, B: absent} → InventoryClient sees A=$5 (correct), B=0 (duplicate risk).
  • Fail-closed-on-cold-start: A's fresh $5 thrown away → chain state preserved as undefined → InventoryClient sees A=0 and B=0 → duplicate risk for both.

Fail-closed is strictly worse here: it loses the healthy bridge's good read without removing duplicate risk for the failed bridge.

Steady state (previous entries exist for all bridges): bridge failure already preserves the previous entry — no regression either way.

Brand-new bridge added mid-run (one bridge has no previous entry because it's genuinely new): if that new bridge fails on its first cycle, current code reports 0 (correct — it's new, no pre-cycle in-flight). Fail-closed would discard the other bridges' fresh reads to "protect" against a risk that doesn't exist.

The fundamental issue is that InventoryClient.getOutstandingCrossChainTransferAmount returns a BigNumber with no "unknown" sentinel. Per-(token, address) failure propagation would require an InventoryClient interface change so it can defer the rebalance decision on unknown state — that's a meaningful follow-up, but escalating to chain-level on cold start in the meantime makes warm-restart cases worse, not better.

The duplicate-rebalance risk for a failed-read bridge with a true-but-unknown in-flight transfer pre-process is inherent and not solvable at this layer without that API change. Happy to track it as a follow-up if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant