[SPARK-57507][SQL] Fix UTF8String.reverse reading past the end of a truncated trailing UTF-8 sequence#56569
Conversation
…runcated trailing UTF-8 sequence ### What changes were proposed in this pull request? `UTF8String.reverse()` reverses a string one UTF-8 character at a time, using `numBytesForFirstByte(getByte(i))` to determine the width of the character at byte position `i`. The per-character copy length was clamped to `numBytes` (the total length) instead of `numBytes - i` (the bytes that actually remain). When the last bytes of the string are a truncated multi-byte sequence (a leader byte whose declared width exceeds the remaining bytes), `copyMemory` read past the end of the string into adjacent memory. This changes the clamp to `numBytes - i`. ### Why are the changes needed? `UTF8String` can hold malformed UTF-8 (for example, bytes produced by binary coercion or truncated input). For such a string ending in an incomplete multi-byte sequence, `reverse()` performed an out-of-bounds read and produced a wrong result. Every other byte scan in the class already bounds by the remaining bytes; this brings `reverse()` in line. Well-formed UTF-8 is unaffected, since a complete sequence never exceeds the remaining bytes. ### Does this PR introduce _any_ user-facing change? Yes, it fixes incorrect results. The SQL `reverse()` function on a string value that contains malformed UTF-8 ending in a truncated multi-byte sequence no longer reads past the end of the value; only previously-incorrect results change. ### How was this patch tested? Added cases to `UTF8StringSuite#reverse()` for truncated trailing 2-, 3-, and 4-byte leaders, and for a complete multi-byte character followed by an orphan leader. Each uses a sliced backing array with a trailing sentinel byte so the previous over-read produces a deterministically wrong value; the cases fail on the old code and pass with the fix.
uros-b
left a comment
There was a problem hiding this comment.
Thank you for working on this @LuciferYang! This is an interesting fix for malformed UTF-8 strings ending in a truncated multi-byte sequence. Historically, we have had quite a few similar issues across the codebase. Adding @mkaravel @srielau @cloud-fan to take a look at this.
UTF8String.reverse reading past the end of a truncated trailing UTF-8 sequence
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @LuciferYang .
cloud-fan
left a comment
There was a problem hiding this comment.
0 blocking, 1 non-blocking, 0 nits.
Correct, minimal, well-tested bounds fix for the reverse() over-read.
Correctness (1)
- UTF8String.java:1160: fix is correct, but the description's "every other byte scan in the class already bounds by the remaining bytes" isn't accurate — see inline
Verification
Traced the fix: with len = min(width, numBytes - i) the copyMemory read stays within [offset, offset+numBytes) for every i, and well-formed UTF-8 is unaffected since a complete sequence never exceeds the remaining bytes. The four new UTF8StringSuite#reverse() cases (truncated 2-/3-/4-byte leaders and a complete char + orphan leader) each carry a trailing sentinel byte beyond the slice, so the old over-read produces a deterministically wrong value — I traced all four and they fail on the old code and pass with the fix.
| int i = 0; // position in byte | ||
| while (i < numBytes) { | ||
| int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes); | ||
| int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i); |
There was a problem hiding this comment.
The fix here is correct. One note on the rationale: the PR description's claim that "every other byte scan in the class already bounds by the remaining bytes" isn't quite accurate. trimLeft (L1052), trimRight (L1124/L1135, via copyUTF8String), and codePointFrom (L736-742) all derive a per-character width from numBytesForFirstByte(...) and read that many bytes without clamping to the remaining bytes — copyUTF8String does an unclamped copyMemory (L944-946) and getByte is an unchecked Platform.getByte. So on the same truncated-trailing-sequence input that motivates this PR, those peers over-read identically. Worth softening the description, or filing a follow-up for the siblings (cf. @uros-b's note about similar issues across the codebase). Not blocking — the reverse() change itself is clean and well-tested.
There was a problem hiding this comment.
Thanks @cloud-fan, good catch. That sentence was inaccurate: codePointFrom, trimLeft, and trimRight (the latter two via copyUTF8String) all take the width from numBytesForFirstByte without clamping to the remaining bytes, so they over-read on the same truncated-trailing-sequence input. I've updated the description to drop that claim and keep this PR scoped to reverse(), and filed SPARK-57520 to handle the siblings (and the unclamped copyUTF8String).
|
do we need to backport to 4.2? this is a data corruption issue, cc @huaxingao @dongjoon-hyun |
I think we should |

What changes were proposed in this pull request?
UTF8String.reverse()reverses a string one UTF-8 character at a time, usingnumBytesForFirstByte(getByte(i))to determine the width of the character at byte positioni. The per-character copy length was clamped tonumBytes(the total length) instead ofnumBytes - i(the bytes that actually remain). When the last bytes of the string are a truncated multi-byte sequence (a leader byte whose declared width exceeds the remaining bytes),copyMemoryread past the end of the string into adjacent memory. This PR changes the clamp tonumBytes - i.Why are the changes needed?
UTF8Stringcan hold malformed UTF-8 (for example, bytes produced by binary coercion or truncated input). For such a string ending in an incomplete multi-byte sequence,reverse()performed an out-of-bounds read and produced a wrong result. This PR clamps the per-character copy length to the bytes that actually remain, so the read stays in bounds. Well-formed UTF-8 is unaffected, since a complete sequence never exceeds the remaining bytes.Note: other helpers that take a character width from
numBytesForFirstBytewithout clamping (codePointFrom, andtrimLeft/trimRightviacopyUTF8String) can over-read on the same truncated input. This PR is scoped toreverse(); the siblings are tracked in SPARK-57520.Does this PR introduce any user-facing change?
Yes, it fixes incorrect results. The SQL
reverse()function on a string value that contains malformed UTF-8 ending in a truncated multi-byte sequence no longer reads past the end of the value; only previously-incorrect results change.How was this patch tested?
Added cases to
UTF8StringSuite#reverse()for truncated trailing 2-, 3-, and 4-byte leaders, and for a complete multi-byte character followed by an orphan leader. Each uses a sliced backing array with a trailing sentinel byte so the previous over-read produces a deterministically wrong value; the cases fail on the old code and pass with the fix.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.8)