fix(REN-5): raise MTParseErrorInvalidCharacter for non-ASCII literal input#233
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new parse error, MTParseErrorInvalidCharacter, to detect and report unrecognized non-ASCII literal characters (such as π or ×) in MTMathListBuilder rather than silently ignoring them, and adds corresponding test cases. The reviewer identified a gap where non-ASCII characters are still silently ignored when they appear in positions processed by skipSpaces (which skips characters greater than 0x7E). To fully resolve this, the reviewer recommended updating skipSpaces and MTAssertNotSpace to prevent non-ASCII characters from being skipped, and suggested adding additional test cases to cover these scenarios.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if (ch > 0x7E) { | ||
| // Non-ASCII literal characters are not supported — report an error instead of | ||
| // silently dropping the character. Callers should use the corresponding LaTeX | ||
| // command (e.g. \pi instead of π, \times instead of ×). | ||
| [self setError:MTParseErrorInvalidCharacter | ||
| message:[NSString stringWithFormat:@"Unknown character U+%04X ('%C') is not a recognized LaTeX input character. Use the corresponding LaTeX command instead.", ch, ch]]; | ||
| return nil; | ||
| } |
There was a problem hiding this comment.
While this change successfully catches non-ASCII characters in the main parsing loop, there is a gap: non-ASCII characters will still be silently ignored if they appear in positions where skipSpaces is called (for example, before delimiters, environments, colors, or expected characters like { or }).
This is because skipSpaces (at line 581) still silently skips any character > 0x7E:
- (void) skipSpaces
{
while ([self hasCharacters]) {
unichar ch = [self getNextCharacter];
if (ch < 0x21 || ch > 0x7E) {
// skip non ascii characters and spaces
continue;
} ...To fully resolve REN-5 and prevent non-ASCII characters from being silently dropped in all contexts, skipSpaces should be updated to only skip ASCII control characters and spaces (i.e., ch < 0x21). Additionally, the MTAssertNotSpace macro (at line 595) should be updated to not assert ch <= 0x7E, as non-ASCII characters are not spaces and should be allowed to propagate to the parser's validation logic:
- (void) skipSpaces
{
while ([self hasCharacters]) {
unichar ch = [self getNextCharacter];
if (ch < 0x21) {
// skip ascii control characters and spaces
continue;
} else {
[self unlookCharacter];
return;
}
}
}
#define MTAssertNotSpace(ch) NSAssert((ch) >= 0x21, @"Expected non space character %C", (ch));| // REN-5: non-ASCII literal characters should produce MTParseErrorInvalidCharacter | ||
| @[@"π", @(MTParseErrorInvalidCharacter)], // π (U+03C0) | ||
| @[@"3 × 4", @(MTParseErrorInvalidCharacter)], // 3 × 4 | ||
| @[@"x ≤ y", @(MTParseErrorInvalidCharacter)], // x ≤ y |
There was a problem hiding this comment.
Once the skipSpaces issue is resolved, it would be highly beneficial to add test cases that verify non-ASCII characters are not silently ignored when they appear before delimiters or environments. For example:
@[@"\\left π (", @(MTParseErrorInvalidCharacter)],
@[@"\\begin π {matrix}", @(MTParseErrorInvalidCharacter)],Currently, these cases will either parse successfully (silently dropping the character) or fail with a different error because skipSpaces silently consumes the non-ASCII character.
|
EM-REVIEW v1 Verdict: No blocking issues. Reviewed independently against the diff and surrounding source. Recommend the existing fix; do not approve/merge here. REN-5 fix (non-ASCII literals were silently dropped):
The error message also helpfully points users at the LaTeX-command alternative (\pi, \times). LGTM. |
…input Previously, non-ASCII characters (e.g. π, ×, ≤) typed directly into a LaTeX string were silently dropped by the builder — the rendered output was missing the character and the caller's error: out-param stayed nil. Adds MTParseErrorInvalidCharacter to the MTParseErrors enum and replaces the silent continue in MTMathListBuilder.m with a setError:/return nil call (scoped to ch > 0x7E) that matches the existing error-reporting model used by MTParseErrorInvalidCommand. ASCII specials (space, $, %, etc.) continue to be silently ignored as before. Closes REN-5. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the non-ASCII handling to cover all special characters, matching real LaTeX behaviour in math mode: - ~ renders as a space (LaTeX non-breaking space). - % # $ now raise MTParseErrorInvalidCharacter (a comment, a macro parameter, and a math-mode toggle respectively - none valid here) instead of being silently dropped. - Whitespace (space, tab, newline, CR) remains silently ignored. - Non-ASCII literals continue to raise MTParseErrorInvalidCharacter. Adds builder/error test cases for ~ and for % # $. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cbef1d8 to
79a5f1e
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the tilde (~) character as a non-breaking space and adds validation to trigger an MTParseErrorInvalidCharacter error when encountering invalid characters in math mode, such as non-ASCII literals or special characters. The review feedback points out a potential issue where formatting the invalid character in the error message using %C can result in unpaired surrogates if the character is a UTF-16 high surrogate, which could lead to crashes. A code suggestion is provided to safely handle surrogate pairs and reconstruct the full Unicode codepoint.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| [self setError:MTParseErrorInvalidCharacter | ||
| message:[NSString stringWithFormat:@"Unknown character U+%04X ('%C') is not a valid LaTeX input character in math mode. Use the corresponding LaTeX command instead.", ch, ch]]; | ||
| return nil; |
There was a problem hiding this comment.
When formatting the error message for an invalid character, if the character ch is a UTF-16 high surrogate (e.g., for characters outside the Basic Multilingual Plane like mathematical symbols), formatting it alone with %C creates an unpaired surrogate. This results in an invalid NSString which can cause crashes or rendering issues when displayed, logged, or serialized.
We should detect if ch is a high surrogate, consume the corresponding low surrogate to form a complete surrogate pair, and format the full Unicode codepoint and character string safely. If the next character is not a valid low surrogate, ensure we restore the lookahead character by calling unlookCharacter.
uint32_t codePoint = ch;
NSString *displayStr = [NSString stringWithFormat:@"%C", ch];
if (ch >= 0xD800 && ch <= 0xDBFF && [self hasCharacters]) {
unichar low = [self getNextCharacter];
if (low >= 0xDC00 && low <= 0xDFFF) {
codePoint = ((ch - 0xD800) << 10) + (low - 0xDC00) + 0x10000;
displayStr = [NSString stringWithFormat:@"%C%C", ch, low];
} else {
[self unlookCharacter];
}
}
[self setError:MTParseErrorInvalidCharacter
message:[NSString stringWithFormat:@"Unknown character U+%04X ('%@') is not a valid LaTeX input character in math mode. Use the corresponding LaTeX command instead.", (unsigned int)codePoint, displayStr]];
return nil;References
- When parsing tokens, avoid silently consuming invalid or non-ASCII characters. Ensure that lookahead characters are properly restored (e.g., by calling
unlookCharacter) before breaking the parsing loop.
…ixes - Decode UTF-16 surrogate pairs in the MTParseErrorInvalidCharacter message so above-BMP literals (e.g. 𝑎) report the real scalar U+1D44E instead of a lone surrogate U+D835. - Silently ignore NUL (catcode 9), the one whitespace-like character TeX actually discards that we were missing. Keep form feed (\par) and vertical tab (catcode "other") as errors — they are not spaces in TeX. - Update stale atomForCharacter: comments that said these characters are "skipped"/"not supported"; the builder now decides (ignore vs. error). - Add tests: astral-literal error, surrogate-pair message decoding, and a whitespace/NUL silent-ignore regression guard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The character is already known to be invalid, so the message doesn't need to render the glyph. Drop the surrogate-pair decoding (only needed for the glyph) and just report the UTF-16 code unit via %04X, which is plain integer formatting with no crash risk. An above-BMP character reports its leading surrogate, which is acceptable for an error string. Removes the now-obsolete surrogate-decoding message test; the astral-literal error-code case and the whitespace/NUL ignore test remain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Addressed code review feedback in #2 — Surrogate-pair error message. Above-BMP literals (e.g. #3 — Stale factory comments. Updated both comments in #4 — TeX whitespace. Verified the catcodes rather than implementing the suggestion as-is: VT (0x0B) and FF (0x0C) are not spaces in TeX (FF is Tests added: astral-literal error case in the parse-error table, and
|
Summary
Fixes REN-5 (see issues.md#L152): non-ASCII literal characters typed directly into a LaTeX string (e.g.
π,×,≤) were silently dropped by the parser. The rendered output was missing the character and the caller'serror:out-param stayednil.MTParseErrorInvalidCharacterto theMTParseErrorsenum inMTMathListBuilder.h(appended at the end to keep existing raw values stable).continueinMTMathListBuilder.m(line ~352) with asetError:MTParseErrorInvalidCharacter/return nilcall, scoped toch > 0x7E(non-ASCII only). ASCII specials (space,$,%,#, etc.) that also returnednilfromatomForCharacter:continue to be silently ignored as before, preserving existing behaviour.\piinstead ofπ).MTParseErrorInvalidCommandfor unknown backslash commands — uses the identicalsetError:/return nilpattern.Test plan
getTestDataParseErrors()inMTMathListBuilderTest.massertingMTParseErrorInvalidCharacterforπ,3 × 4, andx ≤ yswift test: 292 tests, 0 failures after fixswift build: Build complete, no warnings\pi,\times,\leqcommand-based tests continue to pass unchanged🤖 Generated with Claude Code