Skip to content

Add teaching mode to ds4-agent, with teach-bench benchmark#391

Open
rowantrollope wants to merge 4 commits into
antirez:mainfrom
rowantrollope:teaching-mode
Open

Add teaching mode to ds4-agent, with teach-bench benchmark#391
rowantrollope wants to merge 4 commits into
antirez:mainfrom
rowantrollope:teaching-mode

Conversation

@rowantrollope

Copy link
Copy Markdown

Summary

ds4-agent can now act as a programming mentor. A static teaching contract in the system prompt makes it emit <teach> asides (rendered specially) calibrated against a persistent per-user learner profile it maintains via a new learn tool. Levels off/low/medium/high via --teach / /teach; profile inspection via /profile.

tests/teach-bench/ benchmarks aside quality over an 8-task corpus with an LLM judge; experiments.md logs a 6-run prompt-tuning study (baseline retained — composite is currently bounded by model quality and benchmark noise, not prompt wording).

Details

  • ds4_agent.c: teaching contract (agent_teach_prompt), teaching levels with parse/name helpers, <teach> aside rendering, learn tool with persistent learner profile (read/append/consolidate/update-on-exit), per-session level + profile injected as a system note so the rendered sysprompt (and its KV checkpoint) stays byte-identical across levels.
  • tests/teach-bench/: stdlib-only Python CLI (bench/run/eval/rate/report/history/selftest), trace-based aside extraction, OpenAI judge scoring five dimensions against the mentor contract, per-run teaching-prompt hashing so scores tie to prompt versions.
  • tests/teach-bench/experiments.md: tuning log — four hypotheses tested across six benchmark runs, all rejected; the shipped prompt is the measured-best baseline.

Testing

  • make ds4-agent clean build; ./ds4_agent_test passes on top of latest main (rebased on d881f2a).
  • Full teach-bench run: 8/8 task checks pass, 8/8 runs emit asides, judge composite 74.75.

🤖 Generated with Claude Code

rowantrollope and others added 4 commits June 11, 2026 09:05
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ns rejected)

Runs: 20260610-212422 (baseline 74.75), iter1-no-reteach 74.0,
iter1-confirm 69.8, iter2-lead-hook 75.2. No change beat the noise band;
baseline prompt 8eb831fb0183 retained.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ine kept)

iter3-prediction-example 71.4 (example transferred style but produced
speculation-before-evidence), iter4-no-progress-reports 74.7 (killed
narration, failure moved to overlong). Baseline 8eb831fb0183 @ 74.75
remains best after 6 runs / 4 hypotheses.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adjust teachbench.py repo-root lookup (now two levels up) and doc paths.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@rowantrollope

Copy link
Copy Markdown
Author

There were a couple design choices I’d love feedback on from folks. It was not clear what the best approach was;

  1. to truly be useful a good teacher remembers what lessons they’ve taught before and what they know about a student.

So I created a tool which allows the LLM to save what it learns about the developer with appropriate instructions to remember things only useful for teaching. No judgements and so on. It saves this in a profile file.

Some people might not like that. But I found without some form of memory it is hard to make a good experience

It might be good to generalize this, since it's basically the beginnings of memory for ds4-agent and memory can be a very good addition to agent over time. Of course users can always add their own memory skills and approaches, but I do wonder if this is something that should be provided (initially in limited scope, and only using local MD files) just to improve the overall ds4-agent experience.

  1. I tried different approaches to the teaching tokens. Like: interspersed amongst the thinking tokens, after each tool call, etc.

The teaching tokens are output only at the end of a turn so it has all the data about what to teach to the user

  1. When /teach is on I tried turning off the visibility of thinking tokens, but a) I found it makes ds4-agent "seem" too slow (other SOTA coding agents have this problem when they spend long periods thinking since there is no output, and b) I found the thinking tokens are pretty useful for teaching (if you want to read them and follow along).

  2. I think we could do a better job aggregating tool calls to make the overall outputs more readable. For example:

reading x.c
reading main.c
reading other file.c

Could be :

reading x.c, main.c, otherfile.c...


Anyway I hope folks find this valuable and would welcome all inputs and contributions to improve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant