Skip to content

wip feat: Request Checkpoints#696

Draft
stevensJourney wants to merge 6 commits into
mainfrom
client-checkpoints
Draft

wip feat: Request Checkpoints#696
stevensJourney wants to merge 6 commits into
mainfrom
client-checkpoints

Conversation

@stevensJourney

@stevensJourney stevensJourney commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Overview

This adds the PowerSync service component for requestCheckpoint, as mentioned in these proposals:

This is related to the following open PRs:

This PR adds a /sync/checkpoint-request route which clients can use to create Checkpoint requests.

Checkpoint requests currently flow through the standard write checkpoint flow in the sync protocol. The implementation here uses the existing collections/tables for write checkpoints for general checkpoint requests.

Collections

Using the same collections for the previous and current checkpoint requests has a few advantages.

Sync protocol

Checkpoint requests currently flow through the existing write_checkpoint marker in Checkpoint started events. We use the existing lastWriteCheckpoint logic for this. This works regardless of the checkpointing method used by the client.

Migrations

Client and PowerSync service versioning migrations (upgrades and downgrades) are compatible by default. If an existing client has a current write checkpoint record - the client_id is preserved - future requests are monotonically increasing IDs (there are some exceptions to this though - more on that later).

Cleanup

One of the Current Issues in https://github.com/orgs/powersync-ja/discussions/317 are

If an app has many anonymous/temporary users, or regularly creates new temporary databases with unique client ids, it may end up with many write checkpoints on the service. We can never clean these up, since we don't know whether a client would ever connect again and need that write checkpoint. While each individual write checkpoint is slow, this can add up over time when you have hundreds of thousands of unique users/clients.

The goal is to have requested checkpoint records be temporary, where records can be deleted after a period of time. For the current write checkpoint requests, we can never clean these up. Using the same collection allows us to update/mark existing write checkpoints as requested allowing these to be deleted.

Details

The Postgres and MongoDB bucket storage implementations have been updated to accommodate the current - auto incrementing write-checkpoint2.json endpoint behaviour or the new ability to specify a requested checkpoint ID.

The storage update behaviour diverges based off id a requested checkpoint ID has been supplied or not.

No-ops

https://github.com/orgs/powersync-ja/discussions/317 mentions that checkpoint requests should be no-ops when the request_id is unchanged. This PR takes this slightly further and also asserts the requested checkpoint ID should be larger than the currently stored value. The PowerSync service will return the larger value as part of the sync/checkpoint-request response. Clients can use this information to correct for certain edgecases.

No-ops in this case also prevent the advancing of the replication head if no changes were made in a checkpoint batch. For write-checkpoint2.json requests: we always will make a change and will always advance the replication head. For requested checkpoints, if we received only duplicate requests - we attempt to skip the emission of a replication event. More details of this are mentioned in code comments.

Client Migrations

The client needs to track and manage an increasing checkpoint_request_id. For existing users, this means they might have an existing write checkpoint record. The client should start its request sequence at or above this value in order to prevent setting a target_op below a consistency boundary.

Clients typically also re-issue checkpoint requests on connect (due to their temporary nature). If a newly migrated client does not have the checkpoint request id seeded, it is free to attempt a checkpoint request at 1. If a record exists, the service will reject this ID and return the current largest request ID - which the client can detect and re-seed its sequence.

Note: This has one large caveat if we delete checkpoint request records. If a client does not have a seeded checkpoint request value and the service deleted the record - the client would have to start from 1. This could be acceptable due to the following:

  • If the client was using the older write checkpoint method and just migrated:
    • If it was pending a target op, the write checkpoint record should still be present in the DB and can be seeded back to the client
  • If the client was not pending any target (which should be the case after a disconnectAndClear), the next write checkpoint could theoretically start at 1 and resolve correctly (I believe)

One important factor to consider here depends on how we track the checkpoint_request_id on the client. If we store a single value per session (clear it in disconnectAndClear), we have the potential to reset the sequence very often if the service record has been deleted. We could theoretically attempt to store a persisted table of sequences for the user_id/client_id - that would require extracting the user_id somehow which could be more complicated.


AI Disclosure: The following was implemented by first doing a basic implementation by hand - then various changes were assisted by Claude Opus and Codex 5.5.

@changeset-bot

changeset-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 6d03cf3

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@rkistner

Copy link
Copy Markdown
Contributor

Looks good so far!

Some minor issues picked up by Codex listed below. I didn't double-check these, but these seem like plausible issues at a glance.

1. checkpoint_request_id accepts negative and unbounded values

The route accepts codecs.bigint directly at packages/service-core/src/routes/endpoints/ checkpointing.ts:13, and the codec accepts any signed integer string at libs/lib-services/src/codec/
codecs.ts:53. A first request with -1 can create a negative managed write checkpoint, and a huge
value passes route validation but fails later in Postgres at the ::int8 cast in modules/module-
postgres-storage/src/storage/checkpoints/PostgresWriteCheckpointAPI.ts:109. This should be validated
at the API boundary as a positive int64-compatible checkpoint id.

2. Performance: Postgres turns every supplied request into a source marker, including already-processed duplicates

modules/module-postgres-storage/src/storage/checkpoints/PostgresWriteCheckpointAPI.ts:170
intentionally sets shouldAdvance true for every supplied checkpoint because Postgres storage does not
track processed state. That means stale or duplicate /sync/checkpoint-request calls still execute the
keepalive/logical marker path at modules/module-postgres/src/api/PostgresRouteAPIAdapter.ts:253. This
undercuts the PR’s no-op goal and can add avoidable WAL/source writes under reconnect storms or
repeated client retries. Not necessarily a correctness blocker, but it is a real performance
implication to address or explicitly accept.

async createManagedWriteCheckpoints(
checkpoints: storage.ManagedWriteCheckpointOptions[]
): Promise<Map<string, bigint>> {
): Promise<storage.CreateManagedWriteCheckpointsResult> {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential issue with this implementation is that managed write checkpoints appear to store only a single checkpoint association per full user/client id. That means a newer checkpoint request can replace the LSN association for an older pending request.

This can delay write_checkpoint emission for checkpoint requests. Unlike normal write checkpoints tied to a client-side target_op barrier, (ideally) checkpoint requests should not prevent the client from applying incoming changes while waiting for a later source position.

For example:

  • replication is lagging
  • the client sends checkpoint request 42, associated with source LSN A
  • before replication reaches A, the same client sends checkpoint request 43, associated with later source LSN B
  • storage updates the single record to 43 -> B
  • when replication reaches A, there is no longer a stored 42 -> A association to emit
  • the client only receives 43 once replication reaches B, so the earlier request is delayed by unrelated later work

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be an issue for the original use case of blocking sync until the latest checkpoint request is acknowledged: Once the client sends checkpoint request 43, request 42 has no use on the client anymore.

But I guess this changes when we implement explicit checkpoint requests as proposed in https://github.com/orgs/powersync-ja/discussions/324?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess this changes when we implement explicit checkpoint requests as proposed in

Yup, exactly correct. If both use cases use the same underlaying write_checkpoint record, then the extra delay could occur.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine for now.

This only has an effect if all of the following is true:

  1. The upload queue is empty.
  2. There is some replication lag.
  3. The client requests checkpoints at a lower interval than the replication lag.

I don't think it's worth changing the storage format to cater for that case right now, but we can consider addressing that when we do future storage changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants