[SPARK-57506][SQL] Fix RTRIM-collation trimRight dropping trailing spaces for strings with supplementary characters by LuciferYang · Pull Request #56567 · apache/spark

LuciferYang · 2026-06-17T14:00:59Z

What changes were proposed in this pull request?

CollationAwareUTF8String.trimRight — the ICU path used by RTRIM-modifier collations — compared a Java-String (UTF-16) index against a Unicode code-point count: lastNonSpacePosition == srcString.numChars(). lastNonSpacePosition is a UTF-16 index into src = srcString.toValidString() (it is initialized to src.length() and decremented via src.charAt), so the sentinel must be compared against src.length(), matching the charIndex == src.length() check immediately above it. This PR changes that one comparison to src.length().

Why are the changes needed?

For RTRIM-style collations, trailing spaces are ignored while matching the trim characters but must be re-appended to the result. With the code-point count, the "no trailing spaces were skipped" sentinel fired spuriously whenever the number of supplementary characters equaled the number of trailing spaces, so the preserved trailing spaces were dropped. For example, under UNICODE_RTRIM, right-trimming a supplementary character (such as U+1D538) followed by a single space returned an empty string instead of a single space.

Does this PR introduce any user-facing change?

Yes, it fixes incorrect results. Under RTRIM-modifier collations, right-trim on a string containing supplementary characters followed by trailing spaces (when the count of supplementary characters equals the count of trailing spaces) no longer drops the preserved trailing spaces. Only previously-incorrect results change.

How was this patch tested?

Added cases to CollationSupportSuite#testStringTrimRight covering RTRIM-modifier collations across all three trim paths — UTF8_BINARY_RTRIM, UTF8_LCASE_RTRIM, and UNICODE_RTRIM. For each input: supplementary characters with trailing spaces (including the two coincidence cases that previously returned the wrong value on the ICU path), a BMP control, a non-coincidence control, and the all-spaces early-return path. A separate case-folding case (xB/b) confirms only UTF8_LCASE trims the trailing letter while preserving the space, pinning that path independently. The coincidence cases fail on the old code and pass with the fix.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

…aces for strings with supplementary characters ### What changes were proposed in this pull request? `CollationAwareUTF8String.trimRight` (the ICU path used by RTRIM-modifier collations) compared a Java-String (UTF-16) index against a Unicode code-point count: `lastNonSpacePosition == srcString.numChars()`. `lastNonSpacePosition` is a UTF-16 index into `src = srcString.toValidString()` (initialized to `src.length()` and decremented via `src.charAt`), so the sentinel must be compared against `src.length()`. This changes that one comparison to `src.length()`. ### Why are the changes needed? For RTRIM-style collations, trailing spaces are ignored while matching the trim characters but must be re-appended to the result. With the wrong unit, the "no trailing spaces were skipped" sentinel fired spuriously when the number of supplementary characters equaled the number of trailing spaces, dropping the preserved trailing spaces. For example, under UNICODE_RTRIM, right-trimming a supplementary character (such as U+1D538) followed by a single space returned an empty string instead of the preserved space. ### Does this PR introduce _any_ user-facing change? Yes, it fixes incorrect results. Under RTRIM-modifier collations, right-trim on a string that contains supplementary characters followed by trailing spaces (when the count of supplementary characters equals the count of trailing spaces) no longer drops the preserved trailing spaces. Only previously-incorrect results are affected. ### How was this patch tested? Added cases to `CollationSupportSuite#testStringTrimRight` for RTRIM-modifier collations with supplementary characters and trailing spaces: the two coincidence cases that previously returned the wrong value, plus a BMP control, a non-coincidence control, and the all-spaces early-return path. The coincidence cases fail on the old code and pass with the fix.

dongjoon-hyun

+1, LGTM.

Like you mentioned in the PR description, could you add a correctness label to the JIRA issue, @LuciferYang ?

Yes, it fixes incorrect results.

uros-b · 2026-06-17T16:43:22Z

+    assertStringTrimRight("UNICODE_RTRIM", "𝔸 ", "𝔸", " ");
+    assertStringTrimRight("UNICODE_RTRIM", "𝔸  ", "𝔸", "  ");
+    assertStringTrimRight("UNICODE_RTRIM", "𝔸𝔸  ", "𝔸", "  ");
  }


While we're already here, can we please add a few test cases with other RTRIM collations (e.g. UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM) - just to lock down the behaviour across the board.

Done — added UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM next to UNICODE_RTRIM for each input, grouped per input so the cross-collation agreement is visible. These two dispatch to binaryTrimRight / lowercaseTrimRight (not the ICU trimRight this PR fixes); both index by code point throughout, so they were already correct and the new cases pass. Also added a case-folding case (xB / b) where UTF8_LCASE diverges from binary/ICU, so the lcase path can't silently regress to the binary one.

uros-b

Thank you for working on this @LuciferYang, since this is a correctness fix - I think we should consider backporting accordingly. Also, adding @cloud-fan @mkaravel @srielau to take a look at these changes.

cloud-fan

0 blocking, 0 non-blocking, 0 nits.
Clean, surgical correctness fix. LGTM.

The sentinel at CollationAwareUTF8String.java:1522 compared the UTF-16 index lastNonSpacePosition against the code-point count srcString.numChars(); changing it to src.length() puts both sides in the same index space and matches the charIndex == src.length() check directly above.

Verification

Traced regression safety and fix completeness. Regression safety: in the "no spaces skipped" state (lastNonSpacePosition == src.length()) the old code reached the same result via appending src.substring(src.length()) (empty), so only the previously-incorrect coincidence cases change — consistent with the user-facing-change claim. Fix completeness: the two sibling variants binaryTrimRight (1325/1328) and lowercaseTrimRight (1425/1428) index entirely in code-point space and compare against numChars(), so they neither share this bug nor need the same change. Test coverage is well-targeted: the new 𝔸 /𝔸𝔸 coincidence cases exercise the fixed false-branch, and the pre-existing UNICODE_CI, "𝔸", "a", "" case covers its true-branch.

+1 to @uros-b's request to add parallel UTF8_BINARY_RTRIM / UTF8_LCASE_RTRIM cases — those paths don't share this bug, so it's defensive coverage, but worth locking down across the board.

…M in testStringTrimRight Per review feedback, add the UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM RTRIM collations alongside UNICODE_RTRIM for each input, grouped per input. These dispatch to binaryTrimRight / lowercaseTrimRight, which index by code point throughout and were already correct. Also add a case-folding case where UTF8_LCASE diverges from binary/ICU, so the lcase path can't silently regress.

LuciferYang · 2026-06-18T06:45:17Z

+1, LGTM.

Like you mentioned in the PR description, could you add a correctness label to the JIRA issue, @LuciferYang ?

Yes, it fixes incorrect results.

done

dongjoon-hyun approved these changes Jun 17, 2026

View reviewed changes

uros-b reviewed Jun 17, 2026

View reviewed changes

uros-b approved these changes Jun 17, 2026

View reviewed changes

cloud-fan approved these changes Jun 17, 2026

View reviewed changes

HyukjinKwon approved these changes Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57506][SQL] Fix RTRIM-collation trimRight dropping trailing spaces for strings with supplementary characters#56567

[SPARK-57506][SQL] Fix RTRIM-collation trimRight dropping trailing spaces for strings with supplementary characters#56567
LuciferYang wants to merge 2 commits into
apache:masterfrom
LuciferYang:collation-trimright-rtrim-fix

LuciferYang commented Jun 17, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

uros-b Jun 17, 2026

Uh oh!

LuciferYang Jun 18, 2026

Uh oh!

uros-b left a comment

Uh oh!

cloud-fan left a comment

Uh oh!

LuciferYang commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

LuciferYang commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

uros-b Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Verification

Uh oh!

LuciferYang commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LuciferYang commented Jun 17, 2026 •

edited

Loading