Experiment: LALR(1) parser from official MySQL grammar#429
Draft
JanJakes wants to merge 7 commits into
Draft
Conversation
Contributor
🤖 Lexer benchmarkChanges to lexer-related files were detected and triggered a benchmark:
Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally. To reproduce locally: |
Add a new monorepo package that builds a MySQL parser from the official MySQL 8.4 LTS Bison grammar. This commit sets up the package metadata, the build-workspace gitignore, and a README describing the compile pipeline and layout. Source and tooling follow in subsequent commits.
Bring the MySQL lexer and the generic parse-tree primitives (token and node classes) over from mysql-on-sqlite unchanged, as the starting point for this package. The lexer is adapted to the official grammar's token vocabulary in a later commit; this commit is a pure copy so that adaptation is reviewable as a focused diff.
Compile the MySQL parser's grammar from the official sources:
- fetch-mysql-grammar.sh pulls sql_yacc.yy and lex.h from the pinned
mysql-server tag (MYSQL_TAG, default mysql-8.4.3).
- run-bison.sh runs Bison 3.8.2 (the version MySQL 8.4 builds with) in a
pinned Docker image so the automaton is reproducible on any host.
- generate-parse-table.php converts Bison's --xml automaton into a
displacement-packed (comb-vector) ACTION/GOTO table.
- generate-token-map.php derives the lexer-id to Bison-token-number map by
resolving each terminal by name against the automaton, from the lexer's
keyword table crossed with lex.h. It also emits the few terminals the
scanner injects directly (end markers and the WITH ROLLUP contraction).
- bin/build-grammar runs the whole pipeline end to end.
Commit the artifacts produced by bin/build-grammar from the MySQL 8.4.3 grammar: a 5584-state LALR(1) parse table (zero reduce/reduce conflicts, all 59 shift/reduce conflicts resolved by Bison's precedence) and the 849-entry lexer-token to Bison-token map plus the scanner-injected terminals. Regenerate with 'composer run build-grammar'.
Rework remaining_tokens() to produce tokens in the vocabulary the generated parse table expects (Bison token numbers), reconciling the three places the lexer's token model differs from MySQL's grammar in a single pass: - split "@name" into '@' IDENT and "@@" into '@' '@', since the grammar treats "@" as its own terminal; - contract "WITH ROLLUP" into a single WITH_ROLLUP_SYM terminal; - terminate the stream with END_OF_INPUT followed by Bison's end marker. Token numbers come entirely from the generated map, so nothing is hard-coded. On invalid input the partial, unterminated stream is returned, preserving the lexer's native behaviour.
A table-driven shift-reduce parser that executes the generated comb-vector ACTION/GOTO tables and builds a WP_Parser_Node AST. Because the MySQL grammar is unambiguous for LALR(1) (Bison resolves shift/reduce conflicts by precedence and reports zero reduce/reduce conflicts), the runtime is a single deterministic loop with no conflict handling or backtracking. The table data is hoisted into locals and the ACTION lookup inlined, since the loop touches them every step. Also add src/load.php to wire the package's classes in dependency order.
Add tests/benchmark.php, which measures the corpus parse rate and end-to-end (lex + parse) throughput with warmup + timed passes (best/median, JIT detection), mirroring the mysql-on-sqlite parser benchmark so the two compare directly. Document building, usage, and the benchmark in the README. On the MySQL server corpus the parser accepts 99.86% of queries at ~52k QPS without the JIT and ~117k QPS with the tracing JIT, about 4.8x the throughput of the multi-version LL parser on the same machine.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A new experimental package,
packages/mysql-parser, that builds a MySQL parser directly from MySQL's own grammar. It compiles MySQL 8.4 LTS'ssql_yacc.yyandlex.h(unchanged) with the Bison version MySQL uses into a compact parse table, run by a small deterministic LALR(1) parser. The accepted language tracks a real MySQL release exactly, with no hand-maintained grammar to drift.What it does
composer run build-grammar): fetch pinned sources → Bison in Docker → generate the parse table and token map. Re-running reproduces the committed artifacts byte-for-byte.What it doesn't do
WP_Parser_Nodetree, not a typed AST.Numbers
Same machine, ~69.5k-query corpus, end-to-end (lex + parse). Winner in bold.
~4.5–4.8× faster steady-state parsing (4.8× no JIT, 4.5× warm JIT). Boot is a wash cold and cheaper for the LL parser warm. The trade-off for the speed is single-version scope and a raw AST.