Skip to content

Fix charset misdetection causing UnmappableCharacterException for CJK…#7651

Open
XiaoSK wants to merge 3 commits into
openrewrite:mainfrom
XiaoSK:bugfix/file-encoding-mismatch-causing-editing-failure
Open

Fix charset misdetection causing UnmappableCharacterException for CJK…#7651
XiaoSK wants to merge 3 commits into
openrewrite:mainfrom
XiaoSK:bugfix/file-encoding-mismatch-causing-editing-failure

Conversation

@XiaoSK

@XiaoSK XiaoSK commented May 11, 2026

Copy link
Copy Markdown

When EncodingDetectingInputStream detects a file as Windows-1252 due to a stray invalid UTF-8 byte, but the file actually contains CJK content in UTF-8, the decoded content will contain U+FFFD characters (from Windows-1252 undefined byte positions) that cannot be re-encoded as Windows-1252, causing UnmappableCharacterException on write-back.

Add a canEncode verification in readFully(): after decoding with the detected charset, if the result cannot be re-encoded with that charset, re-decode with UTF-8 instead. This catches misdetections at the decoding stage where raw bytes are still available, and all parsers benefit automatically since readFully() is the shared entry point.

… content (openrewrite#7636)

When EncodingDetectingInputStream detects a file as Windows-1252 due to
a stray invalid UTF-8 byte, but the file actually contains CJK content
in UTF-8, the decoded content will contain U+FFFD characters (from
Windows-1252 undefined byte positions) that cannot be re-encoded as
Windows-1252, causing UnmappableCharacterException on write-back.

Add a canEncode verification in readFully(): after decoding with the
detected charset, if the result cannot be re-encoded with that charset,
re-decode with UTF-8 instead. This catches misdetections at the decoding
stage where raw bytes are still available, and all parsers benefit
automatically since readFully() is the shared entry point.
@timtebeek timtebeek self-requested a review June 10, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

EncodingDetectingInputStream misdetects UTF-8 files with CJK content as Windows-1252, causing UnmappableCharacterException on write

2 participants