Skip to content

fix(proxy): clear transfer latch after unsafe attempt#24979

Open
aptend wants to merge 3 commits into
matrixorigin:mainfrom
aptend:fix/proxy-transfer-intent-stuck
Open

fix(proxy): clear transfer latch after unsafe attempt#24979
aptend wants to merge 3 commits into
matrixorigin:mainfrom
aptend:fix/proxy-transfer-intent-stuck

Conversation

@aptend

@aptend aptend commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24942

What this PR does / why we need it:

When a proxy scaling transfer cannot start safely, for example because the session is still in a transaction, the async transfer attempt returned before finishTransfer was registered. Deliver had already marked the tunnel as inTransfer, so later drain polls skipped the tunnel forever and the CN drain could remain blocked until the connection closed.

This PR clears the transfer latch on that unsafe early-return path while preserving transferIntent, so later retries or intent-driven transfer can still happen. It also makes transfer intent metric updates idempotent and has sync transfer acquire the same transfer latch to avoid overlapping async/sync migration attempts.

Tests cover:

  • unsafe async transfer clears inTransfer and can be re-enqueued;
  • sync transfer skips while another transfer attempt owns the latch;
  • sync cannot-start releases the latch;
  • repeated transfer intent true/false updates do not skew the gauge.

Copilot AI review requested due to automatic review settings June 15, 2026 03:02
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a proxy tunnel transfer deadlock where Deliver() could mark a tunnel as inTransfer, but an unsafe async transfer attempt could return early before finishTransfer() cleared the latch—causing subsequent drain polls to skip the tunnel indefinitely and blocking CN draining. It also makes transfer-intent metric updates idempotent and ensures sync transfers respect the same inTransfer latch to prevent overlapping migration attempts.

Changes:

  • Make setTransferIntent idempotent via atomic swap to avoid double-inc/dec of the transfer-intent gauge.
  • Add explicit helpers to acquire/release the inTransfer latch for transfer attempts, and ensure unsafe async early-return clears the latch.
  • Ensure transferSync also acquires/releases the latch and add tests covering unsafe async retryability, sync latch behavior, and metric idempotency.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
pkg/proxy/tunnel.go Clears inTransfer on unsafe async early-return, adds sync transfer latch acquisition, and makes transfer-intent gauge updates idempotent.
pkg/proxy/tunnel_test.go Adds tests for retry behavior after unsafe transfer, sync latch behavior, and idempotent gauge updates.

Comment thread pkg/proxy/tunnel.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants