Skip to content

Experiment: LALR(1) parser from official MySQL grammar#429

Draft
JanJakes wants to merge 7 commits into
trunkfrom
lalr-parser
Draft

Experiment: LALR(1) parser from official MySQL grammar#429
JanJakes wants to merge 7 commits into
trunkfrom
lalr-parser

Conversation

@JanJakes

@JanJakes JanJakes commented Jun 10, 2026

Copy link
Copy Markdown
Member

A new experimental package, packages/mysql-parser, that builds a MySQL parser directly from MySQL's own grammar. It compiles MySQL 8.4 LTS's sql_yacc.yy and lex.h (unchanged) with the Bison version MySQL uses into a compact parse table, run by a small deterministic LALR(1) parser. The accepted language tracks a real MySQL release exactly, with no hand-maintained grammar to drift.

What it does

  • Reproducible build (composer run build-grammar): fetch pinned sources → Bison in Docker → generate the parse table and token map. Re-running reproduces the committed artifacts byte-for-byte.
  • Copies the existing lexer and adapts it (in a separate commit) to emit MySQL's token vocabulary.
  • Deterministic runtime — the 8.4 grammar is unambiguous for LALR(1), so it's a plain shift-reduce loop: no GLR, backtracking, or conflict tables.
  • A corpus benchmark mirroring the existing one.

What it doesn't do

  • Doesn't replace the current parser — it's standalone and nothing depends on it.
  • Single-version: tracks 8.4 exactly, so it rejects ~0.13% of the corpus (pre-8.4 / removed syntax).
  • Builds a raw WP_Parser_Node tree, not a typed AST.
  • The build needs Docker (for pinned Bison); the runtime stays pure PHP.

Numbers

Same machine, ~69.5k-query corpus, end-to-end (lex + parse). Winner in bold.

Metric LL (trunk) LALR (this)
Throughput, no JIT 10,910 QPS 52,521 QPS
Throughput, warm JIT 25,686 QPS 116,060 QPS
Cold boot, no opcache ~2.8 ms ~2.5 ms
Warm boot, opcache+JIT ~0.34 ms ~1.13 ms
Memory, no opcache ~3.4 MB ~3.5 MB
Memory, opcache worker ~1.5 MB ~3.0 MB
Data file on disk 65 KB 121 KB
Parse rate 99.99% 99.86%

~4.5–4.8× faster steady-state parsing (4.8× no JIT, 4.5× warm JIT). Boot is a wash cold and cheaper for the LL parser warm. The trade-off for the speed is single-version scope and a raw AST.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🤖 Lexer benchmark

Changes to lexer-related files were detected and triggered a benchmark:

Config Base (QPS) This PR (QPS) Speedup
no JIT 44,418 72,445 1.63×
tracing JIT 119,762 165,676 1.38×

Note: Hosted runners are noisy, and absolute numbers vary. Treat the results with caution and verify them locally.

To reproduce locally:

cd packages/mysql-on-sqlite && composer run bench-lexer

@JanJakes JanJakes changed the title Add an experimental MySQL parser built from the official 8.4 grammar Experiment: LALR(1) parser from official MySQL grammar Jun 10, 2026
JanJakes added 2 commits June 10, 2026 14:51
Add a new monorepo package that builds a MySQL parser from the official
MySQL 8.4 LTS Bison grammar. This commit sets up the package metadata, the
build-workspace gitignore, and a README describing the compile pipeline and
layout. Source and tooling follow in subsequent commits.
Bring the MySQL lexer and the generic parse-tree primitives (token and
node classes) over from mysql-on-sqlite unchanged, as the starting point
for this package. The lexer is adapted to the official grammar's token
vocabulary in a later commit; this commit is a pure copy so that
adaptation is reviewable as a focused diff.
JanJakes added 5 commits June 10, 2026 16:18
Compile the MySQL parser's grammar from the official sources:

  - fetch-mysql-grammar.sh pulls sql_yacc.yy and lex.h from the pinned
    mysql-server tag (MYSQL_TAG, default mysql-8.4.3).
  - run-bison.sh runs Bison 3.8.2 (the version MySQL 8.4 builds with) in a
    pinned Docker image so the automaton is reproducible on any host.
  - generate-parse-table.php converts Bison's --xml automaton into a
    displacement-packed (comb-vector) ACTION/GOTO table.
  - generate-token-map.php derives the lexer-id to Bison-token-number map by
    resolving each terminal by name against the automaton, from the lexer's
    keyword table crossed with lex.h. It also emits the few terminals the
    scanner injects directly (end markers and the WITH ROLLUP contraction).
  - bin/build-grammar runs the whole pipeline end to end.
Commit the artifacts produced by bin/build-grammar from the MySQL 8.4.3
grammar: a 5584-state LALR(1) parse table (zero reduce/reduce conflicts, all
59 shift/reduce conflicts resolved by Bison's precedence) and the 849-entry
lexer-token to Bison-token map plus the scanner-injected terminals.
Regenerate with 'composer run build-grammar'.
Rework remaining_tokens() to produce tokens in the vocabulary the generated
parse table expects (Bison token numbers), reconciling the three places the
lexer's token model differs from MySQL's grammar in a single pass:

  - split "@name" into '@' IDENT and "@@" into '@' '@', since the grammar
    treats "@" as its own terminal;
  - contract "WITH ROLLUP" into a single WITH_ROLLUP_SYM terminal;
  - terminate the stream with END_OF_INPUT followed by Bison's end marker.

Token numbers come entirely from the generated map, so nothing is hard-coded.
On invalid input the partial, unterminated stream is returned, preserving the
lexer's native behaviour.
A table-driven shift-reduce parser that executes the generated comb-vector
ACTION/GOTO tables and builds a WP_Parser_Node AST. Because the MySQL grammar
is unambiguous for LALR(1) (Bison resolves shift/reduce conflicts by precedence
and reports zero reduce/reduce conflicts), the runtime is a single deterministic
loop with no conflict handling or backtracking. The table data is hoisted into
locals and the ACTION lookup inlined, since the loop touches them every step.

Also add src/load.php to wire the package's classes in dependency order.
Add tests/benchmark.php, which measures the corpus parse rate and end-to-end
(lex + parse) throughput with warmup + timed passes (best/median, JIT
detection), mirroring the mysql-on-sqlite parser benchmark so the two compare
directly. Document building, usage, and the benchmark in the README.

On the MySQL server corpus the parser accepts 99.86% of queries at ~52k QPS
without the JIT and ~117k QPS with the tracing JIT, about 4.8x the throughput
of the multi-version LL parser on the same machine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant