An informal cumulative and competitive frontier model eval using a Javascript chess engine.
Assume A is currently the leading engine (initially 0000_original). A model/CLI is selected to improve it by creating a new engine B via prompt.md at max effort. If a B v A SPRT passes, B becomes the new leader. So for example 0002_sonnet_4_6 was derived from 0000_original, not 0001_haiku_4_5.
/---> 0001 /---> 0004
0000 ---> 0002 ---> 0003 ---> 0005 ---> 0006 etc.
See tools/sprt.
| Engine | Diff | Model | CLI | SPRT |
|---|---|---|---|---|
| 0010_fable_5 | Δ | Anthropic Claude Fable 5 | Claude Code | ✓ |
| 0009_opus_4_8 | Δ | Anthropic Claude Opus 4.8 | Claude Code | ✓ |
| 0008_opus_4_8 | Δ | Anthropic Claude Opus 4.8 | Claude Code | ✓ |
| 0007_opus_4_7 | Δ | Anthropic Claude Opus 4.7 | Claude Code | ✓ |
| 0006_gpt_5_5 | Δ | OpenAI GPT 5.5 | Codex | ✓ |
| 0005_opus_4_7 | Δ | Anthropic Claude Opus 4.7 | Claude Code | ✓ |
| 0004_gpt_5_5 | Δ | OpenAI GPT 5.5 | Codex | ✗ |
| 0003_opus_4_7 | Δ | Anthropic Claude Opus 4.7 | Claude Code | ✓ |
| 0002_sonnet_4_6 | Δ | Anthropic Claude Sonnet 4.6 | Claude Code | ✓ |
| 0001_haiku_4_5 | Δ | Anthropic Claude Haiku 4.5 | Claude Code | ✗ |
| 0000_original |
Elo over the whole accumulated game corpus in ./gauntlet_pgn, anchored so 0000_original is 1800 Elo.
| Rank | Engine | Elo | Games | Score | Draws |
|---|---|---|---|---|---|
| 1 | 0010_fable_5 | 2255 ±21.9 | 2000 | 77.3% | 23.6% |
| 2 | 0009_opus_4_8 | 2189 ±22.1 | 2000 | 69.8% | 24.6% |
| 3 | 0008_opus_4_8 | 2167 ±21.4 | 2000 | 67.2% | 24.8% |
| 4 | 0007_opus_4_7 | 2148 ±20.9 | 2000 | 64.7% | 24.4% |
| 5 | 0006_gpt_5_5 | 2057 ±21.1 | 2000 | 52.6% | 25.6% |
| 6 | 0005_opus_4_7 | 2028 ±20.6 | 2000 | 48.7% | 25.9% |
| 7 | 0003_opus_4_7 | 2018 ±20.0 | 2000 | 47.4% | 25.9% |
| 8 | 0004_gpt_5_5 | 2015 ±20.1 | 2000 | 47.0% | 26.8% |
| 9 | 0002_sonnet_4_6 | 1921 ±20.5 | 2000 | 34.8% | 20.2% |
| 10 | 0000_original | 1800 | 2000 | 21.0% | 12.4% |
| 11 | 0001_haiku_4_5 | 1785 ±21.2 | 2000 | 19.6% | 12.8% |
See tools/gauntlet (add a new engine's games) and tools/rate (rebuild this table).
- https://github.com/Disservin/fastchess - SPRT and tournament manager
- https://github.com/michiguel/Ordo - Elo rating calculation