Skip to content

op12no2/patchwork

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Patchwork

An informal cumulative and competitive frontier model eval using a Javascript chess engine.

Procedure

Assume A is currently the leading engine (initially 0000_original). A model/CLI is selected to improve it by creating a new engine B via prompt.md at max effort. If a B v A SPRT passes, B becomes the new leader. So for example 0002_sonnet_4_6 was derived from 0000_original, not 0001_haiku_4_5.

    /---> 0001          /---> 0004
0000 ---> 0002 ---> 0003 ---> 0005 ---> 0006 etc.

See tools/sprt.

Progress

Engine Diff Model CLI SPRT
0010_fable_5 Δ Anthropic Claude Fable 5 Claude Code
0009_opus_4_8 Δ Anthropic Claude Opus 4.8 Claude Code
0008_opus_4_8 Δ Anthropic Claude Opus 4.8 Claude Code
0007_opus_4_7 Δ Anthropic Claude Opus 4.7 Claude Code
0006_gpt_5_5 Δ OpenAI GPT 5.5 Codex
0005_opus_4_7 Δ Anthropic Claude Opus 4.7 Claude Code
0004_gpt_5_5 Δ OpenAI GPT 5.5 Codex
0003_opus_4_7 Δ Anthropic Claude Opus 4.7 Claude Code
0002_sonnet_4_6 Δ Anthropic Claude Sonnet 4.6 Claude Code
0001_haiku_4_5 Δ Anthropic Claude Haiku 4.5 Claude Code
0000_original

Ratings

Elo over the whole accumulated game corpus in ./gauntlet_pgn, anchored so 0000_original is 1800 Elo.

Rank Engine Elo Games Score Draws
1 0010_fable_5 2255 ±21.9 2000 77.3% 23.6%
2 0009_opus_4_8 2189 ±22.1 2000 69.8% 24.6%
3 0008_opus_4_8 2167 ±21.4 2000 67.2% 24.8%
4 0007_opus_4_7 2148 ±20.9 2000 64.7% 24.4%
5 0006_gpt_5_5 2057 ±21.1 2000 52.6% 25.6%
6 0005_opus_4_7 2028 ±20.6 2000 48.7% 25.9%
7 0003_opus_4_7 2018 ±20.0 2000 47.4% 25.9%
8 0004_gpt_5_5 2015 ±20.1 2000 47.0% 26.8%
9 0002_sonnet_4_6 1921 ±20.5 2000 34.8% 20.2%
10 0000_original 1800 2000 21.0% 12.4%
11 0001_haiku_4_5 1785 ±21.2 2000 19.6% 12.8%

See tools/gauntlet (add a new engine's games) and tools/rate (rebuild this table).

Acknowledgements

About

An informal cumulative and competitive frontier model eval using a Javascript chess engine

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors