Skip to content

[SPARK-57506][SQL] Fix RTRIM-collation trimRight dropping trailing spaces for strings with supplementary characters#56567

Open
LuciferYang wants to merge 2 commits into
apache:masterfrom
LuciferYang:collation-trimright-rtrim-fix
Open

[SPARK-57506][SQL] Fix RTRIM-collation trimRight dropping trailing spaces for strings with supplementary characters#56567
LuciferYang wants to merge 2 commits into
apache:masterfrom
LuciferYang:collation-trimright-rtrim-fix

Conversation

@LuciferYang

@LuciferYang LuciferYang commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

CollationAwareUTF8String.trimRight — the ICU path used by RTRIM-modifier collations — compared a Java-String (UTF-16) index against a Unicode code-point count: lastNonSpacePosition == srcString.numChars(). lastNonSpacePosition is a UTF-16 index into src = srcString.toValidString() (it is initialized to src.length() and decremented via src.charAt), so the sentinel must be compared against src.length(), matching the charIndex == src.length() check immediately above it. This PR changes that one comparison to src.length().

Why are the changes needed?

For RTRIM-style collations, trailing spaces are ignored while matching the trim characters but must be re-appended to the result. With the code-point count, the "no trailing spaces were skipped" sentinel fired spuriously whenever the number of supplementary characters equaled the number of trailing spaces, so the preserved trailing spaces were dropped. For example, under UNICODE_RTRIM, right-trimming a supplementary character (such as U+1D538) followed by a single space returned an empty string instead of a single space.

Does this PR introduce any user-facing change?

Yes, it fixes incorrect results. Under RTRIM-modifier collations, right-trim on a string containing supplementary characters followed by trailing spaces (when the count of supplementary characters equals the count of trailing spaces) no longer drops the preserved trailing spaces. Only previously-incorrect results change.

How was this patch tested?

Added cases to CollationSupportSuite#testStringTrimRight covering RTRIM-modifier collations across all three trim paths — UTF8_BINARY_RTRIM, UTF8_LCASE_RTRIM, and UNICODE_RTRIM. For each input: supplementary characters with trailing spaces (including the two coincidence cases that previously returned the wrong value on the ICU path), a BMP control, a non-coincidence control, and the all-spaces early-return path. A separate case-folding case (xB/b) confirms only UTF8_LCASE trims the trailing letter while preserving the space, pinning that path independently. The coincidence cases fail on the old code and pass with the fix.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

…aces for strings with supplementary characters

### What changes were proposed in this pull request?

`CollationAwareUTF8String.trimRight` (the ICU path used by RTRIM-modifier
collations) compared a Java-String (UTF-16) index against a Unicode code-point
count: `lastNonSpacePosition == srcString.numChars()`. `lastNonSpacePosition` is a
UTF-16 index into `src = srcString.toValidString()` (initialized to `src.length()`
and decremented via `src.charAt`), so the sentinel must be compared against
`src.length()`. This changes that one comparison to `src.length()`.

### Why are the changes needed?

For RTRIM-style collations, trailing spaces are ignored while matching the trim
characters but must be re-appended to the result. With the wrong unit, the "no
trailing spaces were skipped" sentinel fired spuriously when the number of
supplementary characters equaled the number of trailing spaces, dropping the
preserved trailing spaces. For example, under UNICODE_RTRIM, right-trimming a
supplementary character (such as U+1D538) followed by a single space returned an
empty string instead of the preserved space.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes incorrect results. Under RTRIM-modifier collations, right-trim on a
string that contains supplementary characters followed by trailing spaces (when
the count of supplementary characters equals the count of trailing spaces) no
longer drops the preserved trailing spaces. Only previously-incorrect results are
affected.

### How was this patch tested?

Added cases to `CollationSupportSuite#testStringTrimRight` for RTRIM-modifier
collations with supplementary characters and trailing spaces: the two coincidence
cases that previously returned the wrong value, plus a BMP control, a
non-coincidence control, and the all-spaces early-return path. The coincidence
cases fail on the old code and pass with the fix.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Like you mentioned in the PR description, could you add a correctness label to the JIRA issue, @LuciferYang ?

Yes, it fixes incorrect results.

assertStringTrimRight("UNICODE_RTRIM", "𝔸 ", "𝔸", " ");
assertStringTrimRight("UNICODE_RTRIM", "𝔸 ", "𝔸", " ");
assertStringTrimRight("UNICODE_RTRIM", "𝔸𝔸 ", "𝔸", " ");
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're already here, can we please add a few test cases with other RTRIM collations (e.g. UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM) - just to lock down the behaviour across the board.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM next to UNICODE_RTRIM for each input, grouped per input so the cross-collation agreement is visible. These two dispatch to binaryTrimRight / lowercaseTrimRight (not the ICU trimRight this PR fixes); both index by code point throughout, so they were already correct and the new cases pass. Also added a case-folding case (xB / b) where UTF8_LCASE diverges from binary/ICU, so the lcase path can't silently regress to the binary one.

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @LuciferYang, since this is a correctness fix - I think we should consider backporting accordingly. Also, adding @cloud-fan @mkaravel @srielau to take a look at these changes.

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 0 non-blocking, 0 nits.
Clean, surgical correctness fix. LGTM.

The sentinel at CollationAwareUTF8String.java:1522 compared the UTF-16 index lastNonSpacePosition against the code-point count srcString.numChars(); changing it to src.length() puts both sides in the same index space and matches the charIndex == src.length() check directly above.

Verification

Traced regression safety and fix completeness. Regression safety: in the "no spaces skipped" state (lastNonSpacePosition == src.length()) the old code reached the same result via appending src.substring(src.length()) (empty), so only the previously-incorrect coincidence cases change — consistent with the user-facing-change claim. Fix completeness: the two sibling variants binaryTrimRight (1325/1328) and lowercaseTrimRight (1425/1428) index entirely in code-point space and compare against numChars(), so they neither share this bug nor need the same change. Test coverage is well-targeted: the new 𝔸 /𝔸𝔸 coincidence cases exercise the fixed false-branch, and the pre-existing UNICODE_CI, "𝔸", "a", "" case covers its true-branch.

+1 to @uros-b's request to add parallel UTF8_BINARY_RTRIM / UTF8_LCASE_RTRIM cases — those paths don't share this bug, so it's defensive coverage, but worth locking down across the board.

…M in testStringTrimRight

Per review feedback, add the UTF8_BINARY_RTRIM and UTF8_LCASE_RTRIM RTRIM
collations alongside UNICODE_RTRIM for each input, grouped per input. These
dispatch to binaryTrimRight / lowercaseTrimRight, which index by code point
throughout and were already correct. Also add a case-folding case where
UTF8_LCASE diverges from binary/ICU, so the lcase path can't silently regress.
@LuciferYang

Copy link
Copy Markdown
Contributor Author

+1, LGTM.

Like you mentioned in the PR description, could you add a correctness label to the JIRA issue, @LuciferYang ?

Yes, it fixes incorrect results.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants