Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1157,7 +1157,7 @@ public UTF8String reverse() {

int i = 0; // position in byte
while (i < numBytes) {
int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes);
int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix here is correct. One note on the rationale: the PR description's claim that "every other byte scan in the class already bounds by the remaining bytes" isn't quite accurate. trimLeft (L1052), trimRight (L1124/L1135, via copyUTF8String), and codePointFrom (L736-742) all derive a per-character width from numBytesForFirstByte(...) and read that many bytes without clamping to the remaining bytes — copyUTF8String does an unclamped copyMemory (L944-946) and getByte is an unchecked Platform.getByte. So on the same truncated-trailing-sequence input that motivates this PR, those peers over-read identically. Worth softening the description, or filing a follow-up for the siblings (cf. @uros-b's note about similar issues across the codebase). Not blocking — the reverse() change itself is clean and well-tested.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cloud-fan, good catch. That sentence was inaccurate: codePointFrom, trimLeft, and trimRight (the latter two via copyUTF8String) all take the width from numBytesForFirstByte without clamping to the remaining bytes, so they over-read on the same truncated-trailing-sequence input. I've updated the description to drop that claim and keep this PR scoped to reverse(), and filed SPARK-57520 to handle the siblings (and the unclamped copyUTF8String).

int targetOffset = Math.max(result.length - i - len, 0);
copyMemory(this.base, this.offset + i, result,
BYTE_ARRAY_OFFSET + targetOffset, len);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -325,6 +325,30 @@ public void reverse() {
assertEquals(EMPTY_UTF8, EMPTY_UTF8.reverse());
assertEquals(fromString("者行孙"), fromString("孙行者").reverse());
assertEquals(fromString("者行孙 olleh"), fromString("hello 孙行者").reverse());
// Malformed UTF-8: a truncated trailing multi-byte sequence must be reversed as orphan
// bytes without reading past the end of the string. The backing arrays carry an extra
// trailing byte so a regression that over-reads would produce a deterministically wrong
// result rather than reading uninitialized memory.
// 'A' followed by an incomplete 2-byte leader (0xCE).
byte[] truncated2 = new byte[]{0x41, (byte) 0xCE, 0x42};
assertEquals(
fromBytes(new byte[]{(byte) 0xCE, 0x41}),
fromBytes(truncated2, 0, 2).reverse());
// 'A' followed by an incomplete 3-byte leader (0xE4 0xB8).
byte[] truncated3 = new byte[]{0x41, (byte) 0xE4, (byte) 0xB8, 0x42};
assertEquals(
fromBytes(new byte[]{(byte) 0xE4, (byte) 0xB8, 0x41}),
fromBytes(truncated3, 0, 3).reverse());
// 'A' followed by an incomplete 4-byte leader (0xF0 0x90).
byte[] truncated4 = new byte[]{0x41, (byte) 0xF0, (byte) 0x90, 0x42};
assertEquals(
fromBytes(new byte[]{(byte) 0xF0, (byte) 0x90, 0x41}),
fromBytes(truncated4, 0, 3).reverse());
// A complete 3-byte character (U+4E16) followed by an incomplete 2-byte leader (0xCE).
byte[] truncatedMid = new byte[]{(byte) 0xE4, (byte) 0xB8, (byte) 0x96, (byte) 0xCE, 0x42};
assertEquals(
fromBytes(new byte[]{(byte) 0xCE, (byte) 0xE4, (byte) 0xB8, (byte) 0x96}),
fromBytes(truncatedMid, 0, 4).reverse());
}

@Test
Expand Down