Skip to content

[MySQL] Fix idle keepalive LSN stall and multi-server-UUID GTID parsing#706

Draft
michaelbarnes wants to merge 2 commits into
mainfrom
fix/mysql-idle-keepalive-and-gtid-set-parsing
Draft

[MySQL] Fix idle keepalive LSN stall and multi-server-UUID GTID parsing#706
michaelbarnes wants to merge 2 commits into
mainfrom
fix/mysql-idle-keepalive-and-gtid-set-parsing

Conversation

@michaelbarnes

@michaelbarnes michaelbarnes commented Jul 1, 2026

Copy link
Copy Markdown

Fixes #704
Fixes #705

Background

Both fixes come out of a single PowerSync support case: a customer on Service 1.22.0 (MySQL source, MongoDB bucket storage) reported replication lag on a quiet test instance that never recovered until a service restart, with Waiting before creating checkpoint logged every ~30 seconds. One log line from their instance turned out to contain evidence of two separate bugs, both reproduced in our support workbench against the pinned 1.22.0 image and verified unchanged on main.

Fix 1: idle keepalive LSN stall (#704)

Heartbeat keepalives re-sent the LSN from the start of the last transaction, while checkpoints store the LSN from the end of the same transaction. Since LSNs are compared as plain strings, every idle keepalive sorted below last_checkpoint_lsn and checkpoint creation stayed blocked until the next real write.

  • New advanceCommitPosition() helper in BinLogListener advances currentGTID to the commit position and returns the commit LSN
  • Applied at all three commit sites: the Xid handler, the DDL auto-commit path, and the non-transactional query commit path (the latter two exhibit the same stall after a DDL statement on an idle server)
  • The constructor now copies startGTID.position instead of aliasing and mutating the caller's object

Fix 2: multi-server-UUID GTID sets parsed into NaN LSNs (#705)

ReplicatedGTID.comparable assumed a single uuid:ranges value, but SHOW MASTER STATUS returns multi-UUID gtid_executed sets joined with ,\n on servers with failover or restore history. The second UUID was mis-parsed into a NaN transaction id, producing LSNs like 0000000000000NaN|.... On low-transaction-count servers this permanently blocked all checkpoints and was not recoverable by a restart.

  • comparable now parses full GTID sets: multiple UUIDs, whitespace and newline tolerant, multiple intervals per UUID, taking the maximum transaction id across the set, never producing NaN
  • Output format is byte-identical for previously-correct inputs, since persisted LSN strings in bucket storage must keep comparing correctly across the upgrade
  • Also fixes two adjacent parsing defects: multi-interval sets (uuid:1-5:11-18 previously parsed as 5 instead of 18) and MySQL 8.4 tagged GTIDs (previously NaN)
  • deserialize now validates the binlog offset and throws loudly instead of silently producing a NaN position

Tests

  • New ReplicatedGTID.test.ts unit suite (13 tests, no database needed): multi-UUID sets in the exact customer shape, multi-interval sets, ZERO format stability, defensive handling of unparseable segments, serialization round-trips, and ordering pins that document how corrected LSNs compare against legacy NaN-poisoned values
  • New BinLogListener integration test asserting the keepalive LSN after a commit equals the commit LSN; verified to fail on main before the fix
  • Full module suite passes: 127/127 against MySQL 8.0 with both MongoDB and Postgres storage

Notes for reviewers

  • Intentionally out of scope, tracked for follow-up: COMMIT/ROLLBACK query events never reset isTransactionOpen (suppresses keepalives after writes to non-transactional engines), and the binlog offset in the comparable format is not zero-padded (cannot change without breaking comparisons against persisted LSNs)
  • Instances already carrying a poisoned NaN checkpoint LSN self-heal once their transaction id reaches 1000; below that a resync is required, which no forward fix can avoid

🤖 AI disclosure: this pull request was generated by Claude (via Claude Code). The investigation, reproduction, fix, and tests were produced by Claude working from the customer's logs, directed and reviewed by @michaelbarnes.

michaelbarnes and others added 2 commits July 1, 2026 17:21
Heartbeat keepalives re-sent the LSN from the start of the last
transaction, while checkpoints store the LSN from the end of the same
transaction. On an idle server this blocked checkpoint creation
("Waiting before creating checkpoint" every ~30s) until the next
transaction arrived.

All commit paths (Xid, DDL auto-commit, non-transactional query) now
advance the current GTID position to the commit position, so keepalive
LSNs are never behind the last checkpoint LSN. The listener also no
longer mutates the caller's startGTID position object.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ReplicatedGTID.comparable assumed a single server UUID in the raw GTID.
A gtid_executed containing multiple server UUIDs (e.g. after a failover
or restore) was mis-parsed into a NaN transaction id, producing LSNs
like "0000000000000NaN|...". On servers with a low transaction count
this permanently blocked checkpoint creation, and the corrupted LSN
could not be recovered by a service restart.

The comparable LSN now parses full GTID sets (multiple UUIDs joined
with ",\n", multiple intervals per UUID) and uses the maximum
transaction id across the set. Unparseable segments are skipped instead
of poisoning the result. deserialize now validates the binlog offset
instead of silently producing NaN.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@changeset-bot

changeset-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 8e9420d

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 12 packages
Name Type
@powersync/service-module-mysql Patch
@powersync/service-schema Patch
@powersync/service-image Patch
@powersync/service-core Patch
@powersync/service-module-convex Patch
@powersync/service-module-core Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-module-mongodb Patch
@powersync/service-module-mssql Patch
@powersync/service-module-postgres-storage Patch
@powersync/service-module-postgres Patch
test-client Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@michaelbarnes michaelbarnes requested a review from Rentacookie July 1, 2026 23:51
@Rentacookie

Copy link
Copy Markdown
Contributor

Hi @michaelbarnes
Thank you these changes look great! I just want to investigate the multi-server-UUID GTID sets issue a bit more.
Its good that we parse the GTID sets correctly now, but I am actually not sure what effect it has for our replication consistency were we to process GTIDs from multiple servers 😬

@michaelbarnes

Copy link
Copy Markdown
Author

@Rentacookie, when you mention "am actually not sure what effect it has for our replication consistency were we to process GTIDs from multiple servers"

Do you mean multiple servers connecting to a single PowerSync Service instance?

@Rentacookie

Rentacookie commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@michaelbarnes

Do you mean multiple servers connecting to a single PowerSync Service instance?

No, I mean more that events from multiple MySQL servers are appearing on binlog for the MySQL server that PowerSync is actually connected to. I believe this can happen if the MySQL DB we are connecting to is a replica itself, replicating from other MySQL servers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants