Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1519,7 +1519,7 @@ public static UTF8String trimRight(
if (charIndex == src.length()) {
return srcString;
}
if (lastNonSpacePosition == srcString.numChars()) {
if (lastNonSpacePosition == src.length()) {
return UTF8String.fromString(src.substring(0, charIndex));
}
return UTF8String.fromString(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3647,6 +3647,32 @@ public void testStringTrimRight() throws SparkException {
assertStringTrimRight(UTF8_LCASE, "𝔸", "a", "𝔸");
assertStringTrimRight(UNICODE, "𝔸", "a", "𝔸");
assertStringTrimRight(UNICODE_CI, "𝔸", "a", "");
// RTRIM-modifier collations ignore trailing spaces while matching the trim characters, then
// re-append them. The behaviour must agree across the UTF8_BINARY (binaryTrimRight),
// UTF8_LCASE (lowercaseTrimRight), and ICU (trimRight) paths. The supplementary-character
// cases below (trailing-space count == supplementary code-point count) regressed on the ICU
// path before SPARK-57506, which compared a Java-char index against a code-point count.
assertStringTrimRight("UTF8_BINARY_RTRIM", "x ", "x", " ");
assertStringTrimRight("UTF8_LCASE_RTRIM", "x ", "x", " ");
assertStringTrimRight("UNICODE_RTRIM", "x ", "x", " ");
assertStringTrimRight("UTF8_BINARY_RTRIM", " ", "x", " ");
assertStringTrimRight("UTF8_LCASE_RTRIM", " ", "x", " ");
assertStringTrimRight("UNICODE_RTRIM", " ", "x", " ");
assertStringTrimRight("UTF8_BINARY_RTRIM", "𝔸 ", "𝔸", " ");
assertStringTrimRight("UTF8_LCASE_RTRIM", "𝔸 ", "𝔸", " ");
assertStringTrimRight("UNICODE_RTRIM", "𝔸 ", "𝔸", " ");
assertStringTrimRight("UTF8_BINARY_RTRIM", "𝔸 ", "𝔸", " ");
assertStringTrimRight("UTF8_LCASE_RTRIM", "𝔸 ", "𝔸", " ");
assertStringTrimRight("UNICODE_RTRIM", "𝔸 ", "𝔸", " ");
assertStringTrimRight("UTF8_BINARY_RTRIM", "𝔸𝔸 ", "𝔸", " ");
assertStringTrimRight("UTF8_LCASE_RTRIM", "𝔸𝔸 ", "𝔸", " ");
assertStringTrimRight("UNICODE_RTRIM", "𝔸𝔸 ", "𝔸", " ");
// Case-folding interacts with space preservation per path: only UTF8_LCASE folds B to b, so
// only it trims the trailing 'B' and re-appends the space; binary and (case-sensitive) ICU
// leave the input unchanged. This exercises the lcase space-preservation branch on its own.
assertStringTrimRight("UTF8_BINARY_RTRIM", "xB ", "b", "xB ");
assertStringTrimRight("UTF8_LCASE_RTRIM", "xB ", "b", "x ");
assertStringTrimRight("UNICODE_RTRIM", "xB ", "b", "xB ");
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're already here, can we please add a few test cases with other RTRIM collations (e.g. UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM) - just to lock down the behaviour across the board.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM next to UNICODE_RTRIM for each input, grouped per input so the cross-collation agreement is visible. These two dispatch to binaryTrimRight / lowercaseTrimRight (not the ICU trimRight this PR fixes); both index by code point throughout, so they were already correct and the new cases pass. Also added a case-folding case (xB / b) where UTF8_LCASE diverges from binary/ICU, so the lcase path can't silently regress to the binary one.


/**
Expand Down