[MySQL] Fix idle keepalive LSN stall and multi-server-UUID GTID parsing#706
[MySQL] Fix idle keepalive LSN stall and multi-server-UUID GTID parsing#706michaelbarnes wants to merge 2 commits into
Conversation
Heartbeat keepalives re-sent the LSN from the start of the last
transaction, while checkpoints store the LSN from the end of the same
transaction. On an idle server this blocked checkpoint creation
("Waiting before creating checkpoint" every ~30s) until the next
transaction arrived.
All commit paths (Xid, DDL auto-commit, non-transactional query) now
advance the current GTID position to the commit position, so keepalive
LSNs are never behind the last checkpoint LSN. The listener also no
longer mutates the caller's startGTID position object.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ReplicatedGTID.comparable assumed a single server UUID in the raw GTID. A gtid_executed containing multiple server UUIDs (e.g. after a failover or restore) was mis-parsed into a NaN transaction id, producing LSNs like "0000000000000NaN|...". On servers with a low transaction count this permanently blocked checkpoint creation, and the corrupted LSN could not be recovered by a service restart. The comparable LSN now parses full GTID sets (multiple UUIDs joined with ",\n", multiple intervals per UUID) and uses the maximum transaction id across the set. Unparseable segments are skipped instead of poisoning the result. deserialize now validates the binlog offset instead of silently producing NaN. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 8e9420d The changes in this PR will be included in the next version bump. This PR includes changesets to release 12 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Hi @michaelbarnes |
|
@Rentacookie, when you mention "am actually not sure what effect it has for our replication consistency were we to process GTIDs from multiple servers" Do you mean multiple servers connecting to a single PowerSync Service instance? |
No, I mean more that events from multiple MySQL servers are appearing on binlog for the MySQL server that PowerSync is actually connected to. I believe this can happen if the MySQL DB we are connecting to is a replica itself, replicating from other MySQL servers. |
Fixes #704
Fixes #705
Background
Both fixes come out of a single PowerSync support case: a customer on Service 1.22.0 (MySQL source, MongoDB bucket storage) reported replication lag on a quiet test instance that never recovered until a service restart, with
Waiting before creating checkpointlogged every ~30 seconds. One log line from their instance turned out to contain evidence of two separate bugs, both reproduced in our support workbench against the pinned 1.22.0 image and verified unchanged onmain.Fix 1: idle keepalive LSN stall (#704)
Heartbeat keepalives re-sent the LSN from the start of the last transaction, while checkpoints store the LSN from the end of the same transaction. Since LSNs are compared as plain strings, every idle keepalive sorted below
last_checkpoint_lsnand checkpoint creation stayed blocked until the next real write.advanceCommitPosition()helper inBinLogListeneradvancescurrentGTIDto the commit position and returns the commit LSNstartGTID.positioninstead of aliasing and mutating the caller's objectFix 2: multi-server-UUID GTID sets parsed into NaN LSNs (#705)
ReplicatedGTID.comparableassumed a singleuuid:rangesvalue, butSHOW MASTER STATUSreturns multi-UUIDgtid_executedsets joined with,\non servers with failover or restore history. The second UUID was mis-parsed into aNaNtransaction id, producing LSNs like0000000000000NaN|.... On low-transaction-count servers this permanently blocked all checkpoints and was not recoverable by a restart.comparablenow parses full GTID sets: multiple UUIDs, whitespace and newline tolerant, multiple intervals per UUID, taking the maximum transaction id across the set, never producingNaNuuid:1-5:11-18previously parsed as 5 instead of 18) and MySQL 8.4 tagged GTIDs (previouslyNaN)deserializenow validates the binlog offset and throws loudly instead of silently producing aNaNpositionTests
ReplicatedGTID.test.tsunit suite (13 tests, no database needed): multi-UUID sets in the exact customer shape, multi-interval sets, ZERO format stability, defensive handling of unparseable segments, serialization round-trips, and ordering pins that document how corrected LSNs compare against legacy NaN-poisoned valuesBinLogListenerintegration test asserting the keepalive LSN after a commit equals the commit LSN; verified to fail onmainbefore the fixNotes for reviewers
COMMIT/ROLLBACKquery events never resetisTransactionOpen(suppresses keepalives after writes to non-transactional engines), and the binlog offset in the comparable format is not zero-padded (cannot change without breaking comparisons against persisted LSNs)NaNcheckpoint LSN self-heal once their transaction id reaches 1000; below that a resync is required, which no forward fix can avoid🤖 AI disclosure: this pull request was generated by Claude (via Claude Code). The investigation, reproduction, fix, and tests were produced by Claude working from the customer's logs, directed and reviewed by @michaelbarnes.