Add teaching mode to ds4-agent, with teach-bench benchmark by rowantrollope · Pull Request #391 · antirez/ds4

rowantrollope · 2026-06-11T16:12:24Z

Summary

ds4-agent can now act as a programming mentor. A static teaching contract in the system prompt makes it emit <teach> asides (rendered specially) calibrated against a persistent per-user learner profile it maintains via a new learn tool. Levels off/low/medium/high via --teach / /teach; profile inspection via /profile.

tests/teach-bench/ benchmarks aside quality over an 8-task corpus with an LLM judge; experiments.md logs a 6-run prompt-tuning study (baseline retained — composite is currently bounded by model quality and benchmark noise, not prompt wording).

Details

ds4_agent.c: teaching contract (agent_teach_prompt), teaching levels with parse/name helpers, <teach> aside rendering, learn tool with persistent learner profile (read/append/consolidate/update-on-exit), per-session level + profile injected as a system note so the rendered sysprompt (and its KV checkpoint) stays byte-identical across levels.
tests/teach-bench/: stdlib-only Python CLI (bench/run/eval/rate/report/history/selftest), trace-based aside extraction, OpenAI judge scoring five dimensions against the mentor contract, per-run teaching-prompt hashing so scores tie to prompt versions.
tests/teach-bench/experiments.md: tuning log — four hypotheses tested across six benchmark runs, all rejected; the shipped prompt is the measured-best baseline.

Testing

make ds4-agent clean build; ./ds4_agent_test passes on top of latest main (rebased on d881f2a).
Full teach-bench run: 8/8 task checks pass, 8/8 runs emit asides, judge composite 74.75.

🤖 Generated with Claude Code

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ns rejected) Runs: 20260610-212422 (baseline 74.75), iter1-no-reteach 74.0, iter1-confirm 69.8, iter2-lead-hook 75.2. No change beat the noise band; baseline prompt 8eb831fb0183 retained. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ine kept) iter3-prediction-example 71.4 (example transferred style but produced speculation-before-evidence), iter4-no-progress-reports 74.7 (killed narration, failure moved to overlong). Baseline 8eb831fb0183 @ 74.75 remains best after 6 runs / 4 hypotheses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Adjust teachbench.py repo-root lookup (now two levels up) and doc paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

rowantrollope · 2026-06-12T20:00:59Z

There were a couple design choices I’d love feedback on from folks. It was not clear what the best approach was;

to truly be useful a good teacher remembers what lessons they’ve taught before and what they know about a student.

So I created a tool which allows the LLM to save what it learns about the developer with appropriate instructions to remember things only useful for teaching. No judgements and so on. It saves this in a profile file.

Some people might not like that. But I found without some form of memory it is hard to make a good experience

It might be good to generalize this, since it's basically the beginnings of memory for ds4-agent and memory can be a very good addition to agent over time. Of course users can always add their own memory skills and approaches, but I do wonder if this is something that should be provided (initially in limited scope, and only using local MD files) just to improve the overall ds4-agent experience.

I tried different approaches to the teaching tokens. Like: interspersed amongst the thinking tokens, after each tool call, etc.

The teaching tokens are output only at the end of a turn so it has all the data about what to teach to the user

When /teach is on I tried turning off the visibility of thinking tokens, but a) I found it makes ds4-agent "seem" too slow (other SOTA coding agents have this problem when they spend long periods thinking since there is no output, and b) I found the thinking tokens are pretty useful for teaching (if you want to read them and follow along).
I think we could do a better job aggregating tool calls to make the overall outputs more readable. For example:

reading x.c
reading main.c
reading other file.c

Could be :

reading x.c, main.c, otherfile.c...

Anyway I hope folks find this valuable and would welcome all inputs and contributions to improve it.

rowantrollope and others added 4 commits June 11, 2026 09:05

Add teaching mode, teach-bench benchmark, and docs

e5cd391

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Move teach-bench under tests/

6085394

Adjust teachbench.py repo-root lookup (now two levels up) and doc paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add teaching mode to ds4-agent, with teach-bench benchmark#391

Add teaching mode to ds4-agent, with teach-bench benchmark#391
rowantrollope wants to merge 4 commits into
antirez:mainfrom
rowantrollope:teaching-mode

rowantrollope commented Jun 11, 2026

Uh oh!

rowantrollope commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rowantrollope commented Jun 11, 2026

Summary

Details

Testing

Uh oh!

rowantrollope commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant