Skip to content

fix(sync-rules): upper()/lower() should be ASCII-only to match SQLite#663

Closed
sravan27 wants to merge 1 commit into
powersync-ja:mainfrom
sravan27:fix-sync-streams-ascii-upper-lower
Closed

fix(sync-rules): upper()/lower() should be ASCII-only to match SQLite#663
sravan27 wants to merge 1 commit into
powersync-ja:mainfrom
sravan27:fix-sync-streams-ascii-upper-lower

Conversation

@sravan27

@sravan27 sravan27 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Same silent-data-loss class as the merged Sync Streams fixes #644 / #645 / #646 / #647 and the #565 JOIN loud-error PR #662 (open).

The bug

sql_functions.ts implemented upper() / lower() with String.prototype.toUpperCase() / .toLowerCase():

const upper: DocumentedSqlFunction = {
  call(value: SqliteValue) {
    const text = castAsText(value);
    return text?.toUpperCase() ?? null;
  },
  ...
};

JS's case methods are Unicode-aware and length-changing:

  • 'straße'.toUpperCase()'STRASSE' (ß becomes SS — length 6 → 7)
  • 'file'.toUpperCase()'FILE' (fi ligature becomes FI — length 3 → 4)
  • 'İstanbul'.toLowerCase() → adds a combining dot above

SQLite's upper() / lower() are ASCII-only by default: only a-zA-Z is mapped; every other character (including all of the above) passes through unchanged. The PowerSync client runs SQLite, so the server's bucket evaluator and the client's query disagree on the result of upper(name) the moment name contains a non-ASCII letter:

Source value Server upper() (JS) Client upper() (SQLite) Bucket key match?
"straße" "STRASSE" "STRAßE" ❌ silent miss
"file" "FILE" "fiLE" ❌ silent miss
"İstanbul" depends on locale, length-changed unchanged ❌ silent miss
"hello" "HELLO" "HELLO"

Effect: bucket-parameter expressions using UPPER(...) / LOWER(...) silently route the affected rows to the wrong bucket (or no bucket at all). No error, no warning — just missing rows in the client's local DB.

Fix

Two small helpers in sql_functions.ts that walk the string and only map a-zA-Z (the SQLite default), used by upper and lower:

function asciiToUpper(text: string): string {
  let out = '';
  for (let i = 0; i < text.length; i++) {
    const code = text.charCodeAt(i);
    out += code >= 97 && code <= 122 ? String.fromCharCode(code - 32) : text[i];
  }
  return out;
}

The detail strings on upper and lower were updated so the registry doc reflects the actual behaviour.

Tests

New syncTest('upper / lower use ASCII-only semantics (matches SQLite default)') in sqlite_semantics.test.ts:

  • ASCII control ("hello", "Hello World") — unchanged behaviour
  • "straße", "file", "İstanbul" — non-ASCII letters preserved instead of length-changed
  • Length-preservation invariant ('straße'.upper.length == 6, not 7) — pins the silent-data-loss surface explicitly

The existing transforming things with upper-case functions test in sync_rules.test.ts keeps passing (it uses ASCII-only data — 'TEST', 'USER1'). Full sync_rules.test.ts: 40/40.

Compatibility note

If a user is already relying on the Unicode-aware behaviour (e.g. their bucket keys are currently "STRASSE"-shaped), this is a behaviour change. The previous behaviour was already broken in practice — the bucket keys never matched what the client computes — so any working setup is already ASCII-only or has been silently missing rows. The fix brings the server side in line with the actually-working comparison side.

@changeset-bot

changeset-bot Bot commented Jun 6, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: f044aca

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 17 packages
Name Type
@powersync/service-sync-rules Patch
@powersync/service-core Patch
@powersync/lib-services-framework Patch
@powersync/service-module-convex Patch
@powersync/service-module-mongodb-storage Patch
@powersync/service-module-mongodb Patch
@powersync/service-module-mssql Patch
@powersync/service-module-mysql Patch
@powersync/service-module-postgres-storage Patch
@powersync/service-module-postgres Patch
@powersync/service-module-core Patch
@powersync/service-image Patch
test-client Patch
@powersync/service-rsocket-router Patch
@powersync/lib-service-mongodb Patch
@powersync/lib-service-postgres Patch
@powersync/service-schema Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@CLAassistant

CLAassistant commented Jun 6, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@simolus3

simolus3 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

the bucket keys never matched what the client computes

Realistically, this is not an issue. Bucket names are internal, so the only way this could cause a mismatch is if clients would use lower() / upper() as an SQL function to generate Sync Stream parameters to subscribe to. Most users would use JS functions on the client though, so this might actually make things worse.

There's also nothing that stops users from building SQLite with ICU extensions enabled, which would align client behavior with the current implementation in the Service. IMO, being Unicode aware is better than only supporting ASCII, and since we can't match all SQLite installations I think the current implementation is better.

In fact, we might want to override upper() / lower() in the SQLite expression engine to use the unicode-aware JavaScript implementation for consistency.

sravan27 pushed a commit to sravan27/powersync-service that referenced this pull request Jun 6, 2026
…QLite

JS String.prototype.length counts UTF-16 code units. SQLite length() returns characters (code points). For any non-BMP code point (emoji 😀, CJK Extension B-G, ancient scripts) the two diverge: 2 units vs 1 character. Bucket-key expressions like length(name) silently routed rows to the wrong bucket.

Same silent-failure class as merged powersync-ja#644 / powersync-ja#645 / powersync-ja#646 / powersync-ja#647, open powersync-ja#565 JOIN PR powersync-ja#662 and ASCII upper/lower PR powersync-ja#663.
sravan27 pushed a commit to sravan27/powersync-service that referenced this pull request Jun 6, 2026
…QLite

JS String.prototype.length counts UTF-16 code units. SQLite length() returns characters (code points). For any non-BMP code point (emoji 😀, CJK Extension B-G, ancient scripts) the two diverge: 2 units vs 1 character. Bucket-key expressions like length(name) silently routed rows to the wrong bucket.

Same silent-failure class as merged powersync-ja#644 / powersync-ja#645 / powersync-ja#646 / powersync-ja#647, open powersync-ja#565 JOIN PR powersync-ja#662 and ASCII upper/lower PR powersync-ja#663.
sravan27 pushed a commit to sravan27/powersync-service that referenced this pull request Jun 6, 2026
Previously length() on text returned String.prototype.length, which counts UTF-16 code units. JavaScript counts a non-BMP code point (emoji like 😀, CJK Extension B-G, ancient scripts, etc.) as 2 code units, but SQLite's length() returns the character count (1 code point = 1 character).

Effect: a bucket-key expression like length(name) computed a different integer on the server vs the SQLite client for any row containing such characters - rows ended up routed to the wrong bucket (or no bucket at all).

Same silent-data-loss class as the merged powersync-ja#644 / powersync-ja#645 / powersync-ja#646 / powersync-ja#647 and the open powersync-ja#565 JOIN loud-error PR powersync-ja#662 and the ASCII upper/lower fix PR powersync-ja#663.

Test: new regression in sqlite_semantics.test.ts covering ASCII (control), BMP characters (ß stays 6), and non-BMP code points (emoji, U+10000) which now correctly count as 1. Existing 40/40 sync_rules.test.ts still pass.
sravan27 pushed a commit to sravan27/powersync-service that referenced this pull request Jun 6, 2026
…QLite

length() on text returned String.prototype.length, which counts UTF-16 code units. JavaScript counts non-BMP code points (emoji 😀, CJK Extension B-G, ancient scripts) as 2 code units, but SQLite's length() returns characters (1 code point = 1 character).

Effect: bucket-key expressions like length(name) computed different integers server vs client for any row with such characters - rows silently routed to wrong buckets.

Same class as merged powersync-ja#644 / powersync-ja#645 / powersync-ja#646 / powersync-ja#647 and the open powersync-ja#565 JOIN PR powersync-ja#662 + ASCII upper/lower PR powersync-ja#663.
sravan27 pushed a commit to sravan27/powersync-service that referenced this pull request Jun 6, 2026
…QLite

JS String.prototype.length counts UTF-16 code units. SQLite length() returns characters (code points). For any non-BMP code point (emoji 😀, CJK Extension B-G, ancient scripts) the two diverge: 2 units vs 1 character. Bucket-key expressions like length(name) silently routed rows to the wrong bucket.

Same silent-failure class as merged powersync-ja#644 / powersync-ja#645 / powersync-ja#646 / powersync-ja#647, open powersync-ja#565 JOIN PR powersync-ja#662 and ASCII upper/lower PR powersync-ja#663.
sravan27 pushed a commit to sravan27/powersync-service that referenced this pull request Jun 6, 2026
substring() used String.prototype.substring and .length, which count UTF-16 code units. For non-BMP code points (emoji 😀, CJK Extension B-G, ancient scripts) slicing in the middle of a surrogate pair returned a broken unpaired surrogate. Server-side substring output silently disagreed with SQLite client output.

Same silent-failure class as merged powersync-ja#644 / powersync-ja#645 / powersync-ja#646 / powersync-ja#647 and open PRs powersync-ja#662 (JOIN) / powersync-ja#663 (ASCII upper/lower) / powersync-ja#664 (length code points).
sravan27 pushed a commit to sravan27/powersync-service that referenced this pull request Jun 6, 2026
…QLite

length() on text returned String.prototype.length, which counts UTF-16 code units. JavaScript counts non-BMP code points (emoji 😀, CJK Extension B-G, ancient scripts) as 2 code units, but SQLite's length() returns characters (1 code point = 1 character).

Effect: bucket-key expressions like length(name) computed different integers server vs client for any row with such characters - rows silently routed to wrong buckets.

Same class as merged powersync-ja#644 / powersync-ja#645 / powersync-ja#646 / powersync-ja#647 and the open powersync-ja#565 JOIN PR powersync-ja#662 + ASCII upper/lower PR powersync-ja#663.
@sravan27 sravan27 force-pushed the fix-sync-streams-ascii-upper-lower branch 2 times, most recently from 7a5016c to 374e087 Compare June 6, 2026 16:14
Previously the server used String.prototype.toUpperCase()/.toLowerCase(), which are Unicode-aware and perform length-changing case folds (ß -> SS, fi -> FI). SQLite's default is ASCII-only, so server-side bucket keys silently disagreed with client-side parameter values for any non-ASCII letter.

Same silent-failure class as merged powersync-ja#644 / powersync-ja#645 / powersync-ja#646 / powersync-ja#647 and open PR powersync-ja#662 (powersync-ja#565 JOIN loud-error).
@sravan27 sravan27 force-pushed the fix-sync-streams-ascii-upper-lower branch from 374e087 to f044aca Compare June 7, 2026 09:12
sravan27 added a commit to sravan27/powersync-service that referenced this pull request Jun 7, 2026
…QLite

JS String.prototype.length counts UTF-16 code units. SQLite length() returns characters (code points). For any non-BMP code point (emoji 😀, CJK Extension B-G, ancient scripts) the two diverge: 2 units vs 1 character. Bucket-key expressions like length(name) silently routed rows to the wrong bucket.

Same silent-failure class as merged powersync-ja#644 / powersync-ja#645 / powersync-ja#646 / powersync-ja#647, open powersync-ja#565 JOIN PR powersync-ja#662 and ASCII upper/lower PR powersync-ja#663.
sravan27 added a commit to sravan27/powersync-service that referenced this pull request Jun 7, 2026
substring() used String.prototype.substring and .length, which count UTF-16 code units. For non-BMP code points (emoji 😀, CJK Extension B-G, ancient scripts) slicing in the middle of a surrogate pair returned a broken unpaired surrogate. Server-side substring output silently disagreed with SQLite client output.

Same silent-failure class as merged powersync-ja#644 / powersync-ja#645 / powersync-ja#646 / powersync-ja#647 and open PRs powersync-ja#662 (JOIN) / powersync-ja#663 (ASCII upper/lower) / powersync-ja#664 (length code points).
@sravan27

sravan27 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the pushback. Two follow-ups since you make the call:

  1. Internal-bucket-key framing: Agree the JS server / SQLite client mismatch on upper() / lower() is unobservable when both ends are the same engine. The case I had in mind is actually narrower than I wrote it - someone composing a sync-stream parameter on the client with upper(input) (in a stream definition) and then writing it as a bucket key gets STRA+ß+E on the server while a client that re-evaluates upper("straße") against an ICU-enabled SQLite gets STRASSE. As you say, ICU-enabled SQLite builds align that, so the divergence only bites the default ICU-disabled build crossed with a JS engine that does Unicode folding.

  2. JS functions on the client: Also fair - most clients filter in JS where .toUpperCase() does the same thing. The non-aligned case is parameter-construction or bucket-key derivation done via SQL upper() / lower() rather than a JS expression.

Given that, happy to close this and move on, unless you would still take a docs-only change to call out that SQL upper() / lower() over non-ASCII data is a footgun and recommend either (a) pre-folding in JS or (b) building SQLite with ICU. I will close otherwise.

The length() (#664) and substring() (#665) cases are unrelated and not subject to this objection - they diverge from SQLite for the default build too, no ICU involved. Curious whether those fit the same threshold.

@sravan27

sravan27 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

recheck

@sravan27

sravan27 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

On the direction you flagged for follow-up:

we might want to override upper() / lower() in the SQLite expression engine to use the unicode-aware JavaScript implementation for consistency

If that's worth scoping, I have the wa-sqlite hook surfaced (SQLiteAPI.create_function is exposed by @journeyapps/wa-sqlite 1.7.0; the natural insertion point is right after open_v2 in RawSqliteConnection.init so the override applies before any user query) and op-sqlite has updateHook for the native side. Open to taking it as a fixed-scope sprint — $1k flat, covers web (wa-sqlite) + op-sqlite + node + the matching test pass that asserts client-side upper()/lower()/length()/substr() now match what the Service evaluator produces over the same inputs. Same terms as the May sprint with Kobie.

Happy to scope to web-only first if that's cleaner; or close this entirely and move it to a Proposal Discussion if that's the better thread for it. Either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants