[SPARK-57506][SQL] Fix RTRIM-collation trimRight dropping trailing spaces for strings with supplementary characters#56567
Conversation
…aces for strings with supplementary characters ### What changes were proposed in this pull request? `CollationAwareUTF8String.trimRight` (the ICU path used by RTRIM-modifier collations) compared a Java-String (UTF-16) index against a Unicode code-point count: `lastNonSpacePosition == srcString.numChars()`. `lastNonSpacePosition` is a UTF-16 index into `src = srcString.toValidString()` (initialized to `src.length()` and decremented via `src.charAt`), so the sentinel must be compared against `src.length()`. This changes that one comparison to `src.length()`. ### Why are the changes needed? For RTRIM-style collations, trailing spaces are ignored while matching the trim characters but must be re-appended to the result. With the wrong unit, the "no trailing spaces were skipped" sentinel fired spuriously when the number of supplementary characters equaled the number of trailing spaces, dropping the preserved trailing spaces. For example, under UNICODE_RTRIM, right-trimming a supplementary character (such as U+1D538) followed by a single space returned an empty string instead of the preserved space. ### Does this PR introduce _any_ user-facing change? Yes, it fixes incorrect results. Under RTRIM-modifier collations, right-trim on a string that contains supplementary characters followed by trailing spaces (when the count of supplementary characters equals the count of trailing spaces) no longer drops the preserved trailing spaces. Only previously-incorrect results are affected. ### How was this patch tested? Added cases to `CollationSupportSuite#testStringTrimRight` for RTRIM-modifier collations with supplementary characters and trailing spaces: the two coincidence cases that previously returned the wrong value, plus a BMP control, a non-coincidence control, and the all-spaces early-return path. The coincidence cases fail on the old code and pass with the fix.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM.
Like you mentioned in the PR description, could you add a correctness label to the JIRA issue, @LuciferYang ?
Yes, it fixes incorrect results.
| assertStringTrimRight("UNICODE_RTRIM", "𝔸 ", "𝔸", " "); | ||
| assertStringTrimRight("UNICODE_RTRIM", "𝔸 ", "𝔸", " "); | ||
| assertStringTrimRight("UNICODE_RTRIM", "𝔸𝔸 ", "𝔸", " "); | ||
| } |
There was a problem hiding this comment.
While we're already here, can we please add a few test cases with other RTRIM collations (e.g. UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM) - just to lock down the behaviour across the board.
There was a problem hiding this comment.
Done — added UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM next to UNICODE_RTRIM for each input, grouped per input so the cross-collation agreement is visible. These two dispatch to binaryTrimRight / lowercaseTrimRight (not the ICU trimRight this PR fixes); both index by code point throughout, so they were already correct and the new cases pass. Also added a case-folding case (xB / b) where UTF8_LCASE diverges from binary/ICU, so the lcase path can't silently regress to the binary one.
uros-b
left a comment
There was a problem hiding this comment.
Thank you for working on this @LuciferYang, since this is a correctness fix - I think we should consider backporting accordingly. Also, adding @cloud-fan @mkaravel @srielau to take a look at these changes.
cloud-fan
left a comment
There was a problem hiding this comment.
0 blocking, 0 non-blocking, 0 nits.
Clean, surgical correctness fix. LGTM.
The sentinel at CollationAwareUTF8String.java:1522 compared the UTF-16 index lastNonSpacePosition against the code-point count srcString.numChars(); changing it to src.length() puts both sides in the same index space and matches the charIndex == src.length() check directly above.
Verification
Traced regression safety and fix completeness. Regression safety: in the "no spaces skipped" state (lastNonSpacePosition == src.length()) the old code reached the same result via appending src.substring(src.length()) (empty), so only the previously-incorrect coincidence cases change — consistent with the user-facing-change claim. Fix completeness: the two sibling variants binaryTrimRight (1325/1328) and lowercaseTrimRight (1425/1428) index entirely in code-point space and compare against numChars(), so they neither share this bug nor need the same change. Test coverage is well-targeted: the new 𝔸 /𝔸𝔸 coincidence cases exercise the fixed false-branch, and the pre-existing UNICODE_CI, "𝔸", "a", "" case covers its true-branch.
+1 to @uros-b's request to add parallel UTF8_BINARY_RTRIM / UTF8_LCASE_RTRIM cases — those paths don't share this bug, so it's defensive coverage, but worth locking down across the board.
…M in testStringTrimRight Per review feedback, add the UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM RTRIM collations alongside UNICODE_RTRIM for each input, grouped per input. These dispatch to binaryTrimRight / lowercaseTrimRight, which index by code point throughout and were already correct. Also add a case-folding case where UTF8_LCASE diverges from binary/ICU, so the lcase path can't silently regress.
done |
What changes were proposed in this pull request?
CollationAwareUTF8String.trimRight— the ICU path used by RTRIM-modifier collations — compared a Java-String (UTF-16) index against a Unicode code-point count:lastNonSpacePosition == srcString.numChars().lastNonSpacePositionis a UTF-16 index intosrc = srcString.toValidString()(it is initialized tosrc.length()and decremented viasrc.charAt), so the sentinel must be compared againstsrc.length(), matching thecharIndex == src.length()check immediately above it. This PR changes that one comparison tosrc.length().Why are the changes needed?
For RTRIM-style collations, trailing spaces are ignored while matching the trim characters but must be re-appended to the result. With the code-point count, the "no trailing spaces were skipped" sentinel fired spuriously whenever the number of supplementary characters equaled the number of trailing spaces, so the preserved trailing spaces were dropped. For example, under
UNICODE_RTRIM, right-trimming a supplementary character (such as U+1D538) followed by a single space returned an empty string instead of a single space.Does this PR introduce any user-facing change?
Yes, it fixes incorrect results. Under RTRIM-modifier collations, right-trim on a string containing supplementary characters followed by trailing spaces (when the count of supplementary characters equals the count of trailing spaces) no longer drops the preserved trailing spaces. Only previously-incorrect results change.
How was this patch tested?
Added cases to
CollationSupportSuite#testStringTrimRightcovering RTRIM-modifier collations across all three trim paths —UTF8_BINARY_RTRIM,UTF8_LCASE_RTRIM, andUNICODE_RTRIM. For each input: supplementary characters with trailing spaces (including the two coincidence cases that previously returned the wrong value on the ICU path), a BMP control, a non-coincidence control, and the all-spaces early-return path. A separate case-folding case (xB/b) confirms onlyUTF8_LCASEtrims the trailing letter while preserving the space, pinning that path independently. The coincidence cases fail on the old code and pass with the fix.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.8)