Skip to content

[SPARK-57507][SQL] Fix UTF8String.reverse reading past the end of a truncated trailing UTF-8 sequence#56569

Open
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:utf8string-reverse-truncated-fix
Open

[SPARK-57507][SQL] Fix UTF8String.reverse reading past the end of a truncated trailing UTF-8 sequence#56569
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:utf8string-reverse-truncated-fix

Conversation

@LuciferYang

@LuciferYang LuciferYang commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

UTF8String.reverse() reverses a string one UTF-8 character at a time, using numBytesForFirstByte(getByte(i)) to determine the width of the character at byte position i. The per-character copy length was clamped to numBytes (the total length) instead of numBytes - i (the bytes that actually remain). When the last bytes of the string are a truncated multi-byte sequence (a leader byte whose declared width exceeds the remaining bytes), copyMemory read past the end of the string into adjacent memory. This PR changes the clamp to numBytes - i.

Why are the changes needed?

UTF8String can hold malformed UTF-8 (for example, bytes produced by binary coercion or truncated input). For such a string ending in an incomplete multi-byte sequence, reverse() performed an out-of-bounds read and produced a wrong result. This PR clamps the per-character copy length to the bytes that actually remain, so the read stays in bounds. Well-formed UTF-8 is unaffected, since a complete sequence never exceeds the remaining bytes.

Note: other helpers that take a character width from numBytesForFirstByte without clamping (codePointFrom, and trimLeft/trimRight via copyUTF8String) can over-read on the same truncated input. This PR is scoped to reverse(); the siblings are tracked in SPARK-57520.

Does this PR introduce any user-facing change?

Yes, it fixes incorrect results. The SQL reverse() function on a string value that contains malformed UTF-8 ending in a truncated multi-byte sequence no longer reads past the end of the value; only previously-incorrect results change.

How was this patch tested?

Added cases to UTF8StringSuite#reverse() for truncated trailing 2-, 3-, and 4-byte leaders, and for a complete multi-byte character followed by an orphan leader. Each uses a sliced backing array with a trailing sentinel byte so the previous over-read produces a deterministically wrong value; the cases fail on the old code and pass with the fix.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

…runcated trailing UTF-8 sequence

### What changes were proposed in this pull request?

`UTF8String.reverse()` reverses a string one UTF-8 character at a time, using
`numBytesForFirstByte(getByte(i))` to determine the width of the character at byte
position `i`. The per-character copy length was clamped to `numBytes` (the total
length) instead of `numBytes - i` (the bytes that actually remain). When the last
bytes of the string are a truncated multi-byte sequence (a leader byte whose
declared width exceeds the remaining bytes), `copyMemory` read past the end of the
string into adjacent memory. This changes the clamp to `numBytes - i`.

### Why are the changes needed?

`UTF8String` can hold malformed UTF-8 (for example, bytes produced by binary
coercion or truncated input). For such a string ending in an incomplete multi-byte
sequence, `reverse()` performed an out-of-bounds read and produced a wrong result.
Every other byte scan in the class already bounds by the remaining bytes; this
brings `reverse()` in line. Well-formed UTF-8 is unaffected, since a complete
sequence never exceeds the remaining bytes.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes incorrect results. The SQL `reverse()` function on a string value
that contains malformed UTF-8 ending in a truncated multi-byte sequence no longer
reads past the end of the value; only previously-incorrect results change.

### How was this patch tested?

Added cases to `UTF8StringSuite#reverse()` for truncated trailing 2-, 3-, and
4-byte leaders, and for a complete multi-byte character followed by an orphan
leader. Each uses a sliced backing array with a trailing sentinel byte so the
previous over-read produces a deterministically wrong value; the cases fail on the
old code and pass with the fix.

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @LuciferYang! This is an interesting fix for malformed UTF-8 strings ending in a truncated multi-byte sequence. Historically, we have had quite a few similar issues across the codebase. Adding @mkaravel @srielau @cloud-fan to take a look at this.

@uros-b uros-b requested a review from cloud-fan June 17, 2026 16:29
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-57507][SQL] Fix UTF8String.reverse reading past the end of a truncated trailing UTF-8 sequence [SPARK-57507][SQL] Fix UTF8String.reverse reading past the end of a truncated trailing UTF-8 sequence Jun 17, 2026

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @LuciferYang .

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 1 non-blocking, 0 nits.
Correct, minimal, well-tested bounds fix for the reverse() over-read.

Correctness (1)

  • UTF8String.java:1160: fix is correct, but the description's "every other byte scan in the class already bounds by the remaining bytes" isn't accurate — see inline

Verification

Traced the fix: with len = min(width, numBytes - i) the copyMemory read stays within [offset, offset+numBytes) for every i, and well-formed UTF-8 is unaffected since a complete sequence never exceeds the remaining bytes. The four new UTF8StringSuite#reverse() cases (truncated 2-/3-/4-byte leaders and a complete char + orphan leader) each carry a trailing sentinel byte beyond the slice, so the old over-read produces a deterministically wrong value — I traced all four and they fail on the old code and pass with the fix.

int i = 0; // position in byte
while (i < numBytes) {
int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes);
int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix here is correct. One note on the rationale: the PR description's claim that "every other byte scan in the class already bounds by the remaining bytes" isn't quite accurate. trimLeft (L1052), trimRight (L1124/L1135, via copyUTF8String), and codePointFrom (L736-742) all derive a per-character width from numBytesForFirstByte(...) and read that many bytes without clamping to the remaining bytes — copyUTF8String does an unclamped copyMemory (L944-946) and getByte is an unchecked Platform.getByte. So on the same truncated-trailing-sequence input that motivates this PR, those peers over-read identically. Worth softening the description, or filing a follow-up for the siblings (cf. @uros-b's note about similar issues across the codebase). Not blocking — the reverse() change itself is clean and well-tested.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cloud-fan, good catch. That sentence was inaccurate: codePointFrom, trimLeft, and trimRight (the latter two via copyUTF8String) all take the width from numBytesForFirstByte without clamping to the remaining bytes, so they over-read on the same truncated-trailing-sequence input. I've updated the description to drop that claim and keep this PR scoped to reverse(), and filed SPARK-57520 to handle the siblings (and the unclamped copyUTF8String).

@cloud-fan

Copy link
Copy Markdown
Contributor

do we need to backport to 4.2? this is a data corruption issue, cc @huaxingao @dongjoon-hyun

@LuciferYang

Copy link
Copy Markdown
Contributor Author
image

CI passed

@LuciferYang

Copy link
Copy Markdown
Contributor Author

do we need to backport to 4.2?

I think we should

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants