[SPARK-57520][SQL] Fix UTF8String.codePointFrom and copyUTF8String reading past the end of a truncated trailing UTF-8 sequence by LuciferYang · Pull Request #56585 · apache/spark

LuciferYang · 2026-06-18T05:29:30Z

What changes were proposed in this pull request?

UTF8String.codePointFrom decodes a code point by reading numBytesForFirstByte(leader) continuation bytes, and copyUTF8String copies end - start + 1 bytes. Neither bounds the read by the bytes that actually remain, so when a string ends in a truncated multi-byte sequence (a leader byte whose declared width exceeds the remaining bytes), both read past the end of the backing memory. trimLeft/trimRight build their search character through copyUTF8String, so they over-read too.

This PR:

codePointFrom reads continuation bytes through a small continuationByte helper that returns 0 once the index passes the end of the string.
copyUTF8String clamps the copy length to numBytes - start.
Once copyUTF8String stops over-reading, trimRight needs a matching accounting fix: it advanced trimEnd by the leader's declared width, which overshoots a truncated trailing character, so it now uses the actual (clamped) byte count, as trimLeft already does.

Why are the changes needed?

UTF8String can hold malformed UTF-8 (for example, bytes from binary coercion or truncated input). For a string ending in an incomplete multi-byte sequence, these methods read out of bounds and produced wrong results: codePointFrom assembled a code point from adjacent memory, and trimRight could drop valid leading characters. Well-formed UTF-8 is unaffected, since a complete sequence never exceeds the remaining bytes.

This is a follow-up to SPARK-57507, which fixed the same kind of over-read in reverse().

Does this PR introduce any user-facing change?

Yes, it fixes incorrect results on malformed input. String operations that reach these methods (such as trimming or code-point access) no longer read past the end of a value that ends in a truncated multi-byte sequence; only previously-incorrect results change. Well-formed strings behave exactly as before.

How was this patch tested?

Added cases to UTF8StringSuite:

testCodePointFrom: truncated trailing 2-, 3-, and 4-byte leaders, including a 4-byte leader with only the last continuation byte missing.
copyUTF8StringClampsToRemainingBytes: an end one past the last byte, with a non-zero start so the clamp must use numBytes - start.
trimTruncatedTrailingSequence: trimming a truncated trailing leader keeps the valid preceding character.

Each uses a sliced backing array with a trailing sentinel byte, so the previous over-read produces a deterministically wrong value; the cases fail on the old code and pass with the fix. build/sbt 'unsafe/testOnly *UTF8StringSuite' passes (51 tests).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

…ading past the end of a truncated trailing UTF-8 sequence `codePointFrom` read the declared number of continuation bytes (from `numBytesForFirstByte`) without checking they exist, and `copyUTF8String` copied `end - start + 1` bytes without clamping to what remains. When a string ends in a truncated multi-byte sequence (a leader byte whose width exceeds the remaining bytes), both read past the end of the backing memory. `trimLeft`/`trimRight` build their search character through `copyUTF8String`, so they over-read too. `codePointFrom` now reads continuation bytes through a helper that returns 0 past the end, and `copyUTF8String` clamps the copy length to the bytes that remain. Once `copyUTF8String` stops over-reading, `trimRight` needs a matching accounting fix: it decremented `trimEnd` by the declared character width, which overshoots for a truncated trailing character, so it now uses the actual (clamped) byte count, as `trimLeft` already does. Well-formed UTF-8 is unaffected. Follow-up of SPARK-57507.

LuciferYang · 2026-06-18T05:33:33Z

cc @cloud-fan

MaxGekk · 2026-06-18T16:15:36Z

cc @uros-b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57520][SQL] Fix UTF8String.codePointFrom and copyUTF8String reading past the end of a truncated trailing UTF-8 sequence#56585

[SPARK-57520][SQL] Fix UTF8String.codePointFrom and copyUTF8String reading past the end of a truncated trailing UTF-8 sequence#56585
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:SPARK-57520-utf8-overread

LuciferYang commented Jun 18, 2026

Uh oh!

LuciferYang commented Jun 18, 2026

Uh oh!

MaxGekk commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LuciferYang commented Jun 18, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Jun 18, 2026

Uh oh!

MaxGekk commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants