[SPARK-57507][SQL] Fix `UTF8String.reverse` reading past the end of a truncated trailing UTF-8 sequence by LuciferYang · Pull Request #56569 · apache/spark

LuciferYang · 2026-06-17T14:54:57Z

What changes were proposed in this pull request?

UTF8String.reverse() reverses a string one UTF-8 character at a time, using numBytesForFirstByte(getByte(i)) to determine the width of the character at byte position i. The per-character copy length was clamped to numBytes (the total length) instead of numBytes - i (the bytes that actually remain). When the last bytes of the string are a truncated multi-byte sequence (a leader byte whose declared width exceeds the remaining bytes), copyMemory read past the end of the string into adjacent memory. This PR changes the clamp to numBytes - i.

Why are the changes needed?

UTF8String can hold malformed UTF-8 (for example, bytes produced by binary coercion or truncated input). For such a string ending in an incomplete multi-byte sequence, reverse() performed an out-of-bounds read and produced a wrong result. This PR clamps the per-character copy length to the bytes that actually remain, so the read stays in bounds. Well-formed UTF-8 is unaffected, since a complete sequence never exceeds the remaining bytes.

Note: other helpers that take a character width from numBytesForFirstByte without clamping (codePointFrom, and trimLeft/trimRight via copyUTF8String) can over-read on the same truncated input. This PR is scoped to reverse(); the siblings are tracked in SPARK-57520.

Does this PR introduce any user-facing change?

Yes, it fixes incorrect results. The SQL reverse() function on a string value that contains malformed UTF-8 ending in a truncated multi-byte sequence no longer reads past the end of the value; only previously-incorrect results change.

How was this patch tested?

Added cases to UTF8StringSuite#reverse() for truncated trailing 2-, 3-, and 4-byte leaders, and for a complete multi-byte character followed by an orphan leader. Each uses a sliced backing array with a trailing sentinel byte so the previous over-read produces a deterministically wrong value; the cases fail on the old code and pass with the fix.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

…runcated trailing UTF-8 sequence ### What changes were proposed in this pull request? `UTF8String.reverse()` reverses a string one UTF-8 character at a time, using `numBytesForFirstByte(getByte(i))` to determine the width of the character at byte position `i`. The per-character copy length was clamped to `numBytes` (the total length) instead of `numBytes - i` (the bytes that actually remain). When the last bytes of the string are a truncated multi-byte sequence (a leader byte whose declared width exceeds the remaining bytes), `copyMemory` read past the end of the string into adjacent memory. This changes the clamp to `numBytes - i`. ### Why are the changes needed? `UTF8String` can hold malformed UTF-8 (for example, bytes produced by binary coercion or truncated input). For such a string ending in an incomplete multi-byte sequence, `reverse()` performed an out-of-bounds read and produced a wrong result. Every other byte scan in the class already bounds by the remaining bytes; this brings `reverse()` in line. Well-formed UTF-8 is unaffected, since a complete sequence never exceeds the remaining bytes. ### Does this PR introduce _any_ user-facing change? Yes, it fixes incorrect results. The SQL `reverse()` function on a string value that contains malformed UTF-8 ending in a truncated multi-byte sequence no longer reads past the end of the value; only previously-incorrect results change. ### How was this patch tested? Added cases to `UTF8StringSuite#reverse()` for truncated trailing 2-, 3-, and 4-byte leaders, and for a complete multi-byte character followed by an orphan leader. Each uses a sliced backing array with a trailing sentinel byte so the previous over-read produces a deterministically wrong value; the cases fail on the old code and pass with the fix.

uros-b

Thank you for working on this @LuciferYang! This is an interesting fix for malformed UTF-8 strings ending in a truncated multi-byte sequence. Historically, we have had quite a few similar issues across the codebase. Adding @mkaravel @srielau @cloud-fan to take a look at this.

dongjoon-hyun

+1, LGTM. Thank you, @LuciferYang .

cloud-fan

0 blocking, 1 non-blocking, 0 nits.
Correct, minimal, well-tested bounds fix for the reverse() over-read.

Correctness (1)

UTF8String.java:1160: fix is correct, but the description's "every other byte scan in the class already bounds by the remaining bytes" isn't accurate — see inline

Verification

Traced the fix: with len = min(width, numBytes - i) the copyMemory read stays within [offset, offset+numBytes) for every i, and well-formed UTF-8 is unaffected since a complete sequence never exceeds the remaining bytes. The four new UTF8StringSuite#reverse() cases (truncated 2-/3-/4-byte leaders and a complete char + orphan leader) each carry a trailing sentinel byte beyond the slice, so the old over-read produces a deterministically wrong value — I traced all four and they fail on the old code and pass with the fix.

cloud-fan · 2026-06-17T21:37:08Z

    int i = 0; // position in byte
    while (i < numBytes) {
-      int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes);
+      int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i);


The fix here is correct. One note on the rationale: the PR description's claim that "every other byte scan in the class already bounds by the remaining bytes" isn't quite accurate. trimLeft (L1052), trimRight (L1124/L1135, via copyUTF8String), and codePointFrom (L736-742) all derive a per-character width from numBytesForFirstByte(...) and read that many bytes without clamping to the remaining bytes — copyUTF8String does an unclamped copyMemory (L944-946) and getByte is an unchecked Platform.getByte. So on the same truncated-trailing-sequence input that motivates this PR, those peers over-read identically. Worth softening the description, or filing a follow-up for the siblings (cf. @uros-b's note about similar issues across the codebase). Not blocking — the reverse() change itself is clean and well-tested.

Thanks @cloud-fan, good catch. That sentence was inaccurate: codePointFrom, trimLeft, and trimRight (the latter two via copyUTF8String) all take the width from numBytesForFirstByte without clamping to the remaining bytes, so they over-read on the same truncated-trailing-sequence input. I've updated the description to drop that claim and keep this PR scoped to reverse(), and filed SPARK-57520 to handle the siblings (and the unclamped copyUTF8String).

cloud-fan · 2026-06-18T04:35:42Z

do we need to backport to 4.2? this is a data corruption issue, cc @huaxingao @dongjoon-hyun

LuciferYang · 2026-06-18T04:49:30Z

CI passed

LuciferYang · 2026-06-18T04:51:15Z

do we need to backport to 4.2?

I think we should

uros-b approved these changes Jun 17, 2026

View reviewed changes

uros-b requested a review from cloud-fan June 17, 2026 16:29

dongjoon-hyun changed the title ~~[SPARK-57507][SQL] Fix UTF8String.reverse reading past the end of a truncated trailing UTF-8 sequence~~ [SPARK-57507][SQL] Fix UTF8String.reverse reading past the end of a truncated trailing UTF-8 sequence Jun 17, 2026

dongjoon-hyun approved these changes Jun 17, 2026

View reviewed changes

sarutak approved these changes Jun 17, 2026

View reviewed changes

cloud-fan reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57507][SQL] Fix `UTF8String.reverse` reading past the end of a truncated trailing UTF-8 sequence#56569

[SPARK-57507][SQL] Fix `UTF8String.reverse` reading past the end of a truncated trailing UTF-8 sequence#56569
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:utf8string-reverse-truncated-fix

LuciferYang commented Jun 17, 2026 •

edited

Loading

Uh oh!

uros-b left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

cloud-fan left a comment

Uh oh!

cloud-fan Jun 17, 2026

Uh oh!

LuciferYang Jun 18, 2026

Uh oh!

cloud-fan commented Jun 18, 2026

Uh oh!

LuciferYang commented Jun 18, 2026

Uh oh!

LuciferYang commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

LuciferYang commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Correctness (1)

Verification

Uh oh!

cloud-fan Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 18, 2026

Uh oh!

LuciferYang commented Jun 18, 2026

Uh oh!

LuciferYang commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LuciferYang commented Jun 17, 2026 •

edited

Loading