Skip to content

fix(REN-5): raise MTParseErrorInvalidCharacter for non-ASCII literal input#233

Merged
kostub merged 4 commits into
masterfrom
em/2026-06-11-issues/t9
Jun 28, 2026
Merged

fix(REN-5): raise MTParseErrorInvalidCharacter for non-ASCII literal input#233
kostub merged 4 commits into
masterfrom
em/2026-06-11-issues/t9

Conversation

@kostub

@kostub kostub commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes REN-5 (see issues.md#L152): non-ASCII literal characters typed directly into a LaTeX string (e.g. π, ×, ) were silently dropped by the parser. The rendered output was missing the character and the caller's error: out-param stayed nil.

  • Adds MTParseErrorInvalidCharacter to the MTParseErrors enum in MTMathListBuilder.h (appended at the end to keep existing raw values stable).
  • Replaces the silent continue in MTMathListBuilder.m (line ~352) with a setError:MTParseErrorInvalidCharacter / return nil call, scoped to ch > 0x7E (non-ASCII only). ASCII specials (space, $, %, #, etc.) that also returned nil from atomForCharacter: continue to be silently ignored as before, preserving existing behaviour.
  • The error message names the offending code point and suggests the corresponding LaTeX command (e.g. \pi instead of π).
  • Directly parallels MTParseErrorInvalidCommand for unknown backslash commands — uses the identical setError: / return nil pattern.

Test plan

  • Added three table-driven entries to getTestDataParseErrors() in MTMathListBuilderTest.m asserting MTParseErrorInvalidCharacter for π, 3 × 4, and x ≤ y
  • Confirmed tests failed before implementation (list was non-nil, no error)
  • swift test: 292 tests, 0 failures after fix
  • swift build: Build complete, no warnings
  • Existing \pi, \times, \leq command-based tests continue to pass unchanged

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new parse error, MTParseErrorInvalidCharacter, to detect and report unrecognized non-ASCII literal characters (such as π or ×) in MTMathListBuilder rather than silently ignoring them, and adds corresponding test cases. The reviewer identified a gap where non-ASCII characters are still silently ignored when they appear in positions processed by skipSpaces (which skips characters greater than 0x7E). To fully resolve this, the reviewer recommended updating skipSpaces and MTAssertNotSpace to prevent non-ASCII characters from being skipped, and suggested adding additional test cases to cover these scenarios.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread iosMath/lib/MTMathListBuilder.m Outdated
Comment on lines +351 to +358
if (ch > 0x7E) {
// Non-ASCII literal characters are not supported — report an error instead of
// silently dropping the character. Callers should use the corresponding LaTeX
// command (e.g. \pi instead of π, \times instead of ×).
[self setError:MTParseErrorInvalidCharacter
message:[NSString stringWithFormat:@"Unknown character U+%04X ('%C') is not a recognized LaTeX input character. Use the corresponding LaTeX command instead.", ch, ch]];
return nil;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While this change successfully catches non-ASCII characters in the main parsing loop, there is a gap: non-ASCII characters will still be silently ignored if they appear in positions where skipSpaces is called (for example, before delimiters, environments, colors, or expected characters like { or }).

This is because skipSpaces (at line 581) still silently skips any character > 0x7E:

- (void) skipSpaces
{
    while ([self hasCharacters]) {
        unichar ch = [self getNextCharacter];
        if (ch < 0x21 || ch > 0x7E) {
            // skip non ascii characters and spaces
            continue;
        } ...

To fully resolve REN-5 and prevent non-ASCII characters from being silently dropped in all contexts, skipSpaces should be updated to only skip ASCII control characters and spaces (i.e., ch < 0x21). Additionally, the MTAssertNotSpace macro (at line 595) should be updated to not assert ch <= 0x7E, as non-ASCII characters are not spaces and should be allowed to propagate to the parser's validation logic:

- (void) skipSpaces
{
    while ([self hasCharacters]) {
        unichar ch = [self getNextCharacter];
        if (ch < 0x21) {
            // skip ascii control characters and spaces
            continue;
        } else {
            [self unlookCharacter];
            return;
        }
    }
}

#define MTAssertNotSpace(ch) NSAssert((ch) >= 0x21, @"Expected non space character %C", (ch));

Comment on lines +1474 to +1477
// REN-5: non-ASCII literal characters should produce MTParseErrorInvalidCharacter
@[@"π", @(MTParseErrorInvalidCharacter)], // π (U+03C0)
@[@"3 × 4", @(MTParseErrorInvalidCharacter)], // 3 × 4
@[@"x ≤ y", @(MTParseErrorInvalidCharacter)], // x ≤ y

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Once the skipSpaces issue is resolved, it would be highly beneficial to add test cases that verify non-ASCII characters are not silently ignored when they appear before delimiters or environments. For example:

@[@"\\left π (", @(MTParseErrorInvalidCharacter)],
@[@"\\begin π {matrix}", @(MTParseErrorInvalidCharacter)],

Currently, these cases will either parse successfully (silently dropping the character) or fail with a different error because skipSpaces silently consumes the non-ASCII character.

@kostub

kostub commented Jun 12, 2026

Copy link
Copy Markdown
Owner Author

EM-REVIEW v1

Verdict: No blocking issues. Reviewed independently against the diff and surrounding source. Recommend the existing fix; do not approve/merge here.

REN-5 fix (non-ASCII literals were silently dropped):

  • Guard scope ch > 0x7E is correct. atomForCharacter: returns nil for three distinct cases: ch < 0x21 || ch > 0x7E (whitespace + non-ASCII), the ASCII control chars ($ % # & ~ ^ _ { } \\ etc.), and a final special set. The PR narrows the new error to only the non-ASCII literal subset (ch > 0x7E). Broadening to < 0x21 would regress whitespace handling (spaces/newlines must keep being silently ignored), and ASCII specials intentionally retain their silent-skip continue behavior. Scope is exactly right.

  • Enum case appended, no renumbering. MTParseErrorInvalidCharacter is added to the end of the MTParseErrors NS_ENUM in MTMathListBuilder.h after MTParseErrorInvalidLimits, so existing error-code values are unchanged — no ABI/code-value breakage.

  • Error raised with no state leak. The fix replaces the silent continue (~MTMathListBuilder.m:352) with setError:MTParseErrorInvalidCharacter + return nil. return nil propagates out of buildInternal:; the top-level build returns nil whenever _error is set, and no partial atom is appended. Clean failure path.

  • Existing ASCII/space tests preserved. The narrowing leaves _spacesAllowed handling and the silent-ignore behavior for ASCII specials untouched; the existing passing tests continue to pass (292 total).

  • New tests are meaningful. Three parse-error fixtures added to the error table — π (U+03C0), 3 × 4, x ≤ y — each asserting MTParseErrorInvalidCharacter, covering a standalone non-ASCII char and non-ASCII embedded among valid ASCII input.

The error message also helpfully points users at the LaTeX-command alternative (\pi, \times). LGTM.

kostub and others added 2 commits June 26, 2026 00:55
…input

Previously, non-ASCII characters (e.g. π, ×, ≤) typed directly into a
LaTeX string were silently dropped by the builder — the rendered output
was missing the character and the caller's error: out-param stayed nil.

Adds MTParseErrorInvalidCharacter to the MTParseErrors enum and replaces
the silent continue in MTMathListBuilder.m with a setError:/return nil call
(scoped to ch > 0x7E) that matches the existing error-reporting model used
by MTParseErrorInvalidCommand. ASCII specials (space, $, %, etc.) continue
to be silently ignored as before.

Closes REN-5.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the non-ASCII handling to cover all special characters, matching
real LaTeX behaviour in math mode:

- ~ renders as a space (LaTeX non-breaking space).
- % # $ now raise MTParseErrorInvalidCharacter (a comment, a macro
  parameter, and a math-mode toggle respectively - none valid here)
  instead of being silently dropped.
- Whitespace (space, tab, newline, CR) remains silently ignored.
- Non-ASCII literals continue to raise MTParseErrorInvalidCharacter.

Adds builder/error test cases for ~ and for % # $.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kostub kostub force-pushed the em/2026-06-11-issues/t9 branch from cbef1d8 to 79a5f1e Compare June 25, 2026 19:33
@kostub

kostub commented Jun 25, 2026

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the tilde (~) character as a non-breaking space and adds validation to trigger an MTParseErrorInvalidCharacter error when encountering invalid characters in math mode, such as non-ASCII literals or special characters. The review feedback points out a potential issue where formatting the invalid character in the error message using %C can result in unpaired surrogates if the character is a UTF-16 high surrogate, which could lead to crashes. A code suggestion is provided to safely handle surrogate pairs and reconstruct the full Unicode codepoint.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +375 to +377
[self setError:MTParseErrorInvalidCharacter
message:[NSString stringWithFormat:@"Unknown character U+%04X ('%C') is not a valid LaTeX input character in math mode. Use the corresponding LaTeX command instead.", ch, ch]];
return nil;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When formatting the error message for an invalid character, if the character ch is a UTF-16 high surrogate (e.g., for characters outside the Basic Multilingual Plane like mathematical symbols), formatting it alone with %C creates an unpaired surrogate. This results in an invalid NSString which can cause crashes or rendering issues when displayed, logged, or serialized.

We should detect if ch is a high surrogate, consume the corresponding low surrogate to form a complete surrogate pair, and format the full Unicode codepoint and character string safely. If the next character is not a valid low surrogate, ensure we restore the lookahead character by calling unlookCharacter.

                uint32_t codePoint = ch;
                NSString *displayStr = [NSString stringWithFormat:@"%C", ch];
                if (ch >= 0xD800 && ch <= 0xDBFF && [self hasCharacters]) {
                    unichar low = [self getNextCharacter];
                    if (low >= 0xDC00 && low <= 0xDFFF) {
                        codePoint = ((ch - 0xD800) << 10) + (low - 0xDC00) + 0x10000;
                        displayStr = [NSString stringWithFormat:@"%C%C", ch, low];
                    } else {
                        [self unlookCharacter];
                    }
                }
                [self setError:MTParseErrorInvalidCharacter
                       message:[NSString stringWithFormat:@"Unknown character U+%04X ('%@') is not a valid LaTeX input character in math mode. Use the corresponding LaTeX command instead.", (unsigned int)codePoint, displayStr]];
                return nil;
References
  1. When parsing tokens, avoid silently consuming invalid or non-ASCII characters. Ensure that lookahead characters are properly restored (e.g., by calling unlookCharacter) before breaking the parsing loop.

kostub and others added 2 commits June 28, 2026 01:35
…ixes

- Decode UTF-16 surrogate pairs in the MTParseErrorInvalidCharacter message
  so above-BMP literals (e.g. 𝑎) report the real scalar U+1D44E instead of a
  lone surrogate U+D835.
- Silently ignore NUL (catcode 9), the one whitespace-like character TeX
  actually discards that we were missing. Keep form feed (\par) and vertical
  tab (catcode "other") as errors — they are not spaces in TeX.
- Update stale atomForCharacter: comments that said these characters are
  "skipped"/"not supported"; the builder now decides (ignore vs. error).
- Add tests: astral-literal error, surrogate-pair message decoding, and a
  whitespace/NUL silent-ignore regression guard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The character is already known to be invalid, so the message doesn't need
to render the glyph. Drop the surrogate-pair decoding (only needed for the
glyph) and just report the UTF-16 code unit via %04X, which is plain integer
formatting with no crash risk. An above-BMP character reports its leading
surrogate, which is acceptable for an error string.

Removes the now-obsolete surrogate-decoding message test; the astral-literal
error-code case and the whitespace/NUL ignore test remain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kostub

kostub commented Jun 28, 2026

Copy link
Copy Markdown
Owner Author

Addressed code review feedback in ad020ca and d0e4b57:

#2 — Surrogate-pair error message. Above-BMP literals (e.g. 𝑎, U+1D44E) were already correctly raising MTParseErrorInvalidCharacter via their leading surrogate, but the message reported the lone surrogate. Since the character is already known invalid, rather than decode the pair just to render the glyph, the message now drops the glyph entirely and reports the code unit via %04X (plain integer formatting, no crash risk). Net simplification.

#3 — Stale factory comments. Updated both comments in atomForCharacter: that said these characters are "skipped"/"not supported" — the builder now decides (ignore whitespace vs. raise MTParseErrorInvalidCharacter), and & ~ ' are consumed earlier in the loop.

#4 — TeX whitespace. Verified the catcodes rather than implementing the suggestion as-is: VT (0x0B) and FF (0x0C) are not spaces in TeX (FF is \par, VT is catcode-"other"), so keeping them as errors is the TeX-faithful behavior. The one genuinely-discarded character we were missing is NUL (0x00, catcode 9), which is now silently ignored alongside space/tab/newline/CR.

Tests added: astral-literal error case in the parse-error table, and testIgnoredWhitespaceCharacters guarding that tab/newline/CR/NUL still parse without error.

swift build clean, full suite (335 tests) passing.

@kostub kostub merged commit 3516101 into master Jun 28, 2026
1 check passed
@kostub kostub deleted the em/2026-06-11-issues/t9 branch June 28, 2026 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant