Add teaching mode to ds4-agent, with teach-bench benchmark#391
Add teaching mode to ds4-agent, with teach-bench benchmark#391rowantrollope wants to merge 4 commits into
Conversation
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ns rejected) Runs: 20260610-212422 (baseline 74.75), iter1-no-reteach 74.0, iter1-confirm 69.8, iter2-lead-hook 75.2. No change beat the noise band; baseline prompt 8eb831fb0183 retained. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ine kept) iter3-prediction-example 71.4 (example transferred style but produced speculation-before-evidence), iter4-no-progress-reports 74.7 (killed narration, failure moved to overlong). Baseline 8eb831fb0183 @ 74.75 remains best after 6 runs / 4 hypotheses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Adjust teachbench.py repo-root lookup (now two levels up) and doc paths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
There were a couple design choices I’d love feedback on from folks. It was not clear what the best approach was;
So I created a tool which allows the LLM to save what it learns about the developer with appropriate instructions to remember things only useful for teaching. No judgements and so on. It saves this in a profile file. Some people might not like that. But I found without some form of memory it is hard to make a good experience It might be good to generalize this, since it's basically the beginnings of memory for ds4-agent and memory can be a very good addition to agent over time. Of course users can always add their own memory skills and approaches, but I do wonder if this is something that should be provided (initially in limited scope, and only using local MD files) just to improve the overall ds4-agent experience.
The teaching tokens are output only at the end of a turn so it has all the data about what to teach to the user
reading x.c Could be : reading x.c, main.c, otherfile.c... Anyway I hope folks find this valuable and would welcome all inputs and contributions to improve it. |
Summary
ds4-agent can now act as a programming mentor. A static teaching contract in the system prompt makes it emit
<teach>asides (rendered specially) calibrated against a persistent per-user learner profile it maintains via a newlearntool. Levels off/low/medium/high via--teach//teach; profile inspection via/profile.tests/teach-bench/benchmarks aside quality over an 8-task corpus with an LLM judge;experiments.mdlogs a 6-run prompt-tuning study (baseline retained — composite is currently bounded by model quality and benchmark noise, not prompt wording).Details
agent_teach_prompt), teaching levels with parse/name helpers,<teach>aside rendering,learntool with persistent learner profile (read/append/consolidate/update-on-exit), per-session level + profile injected as a system note so the rendered sysprompt (and its KV checkpoint) stays byte-identical across levels.bench/run/eval/rate/report/history/selftest), trace-based aside extraction, OpenAI judge scoring five dimensions against the mentor contract, per-run teaching-prompt hashing so scores tie to prompt versions.Testing
make ds4-agentclean build;./ds4_agent_testpasses on top of latest main (rebased on d881f2a).🤖 Generated with Claude Code