deep-spec is an autoresearch-inspired evolutionary hill-climbing project for producing a faithful /deep-spec interview skill.
The target is frozen in targets/symphony/. The thing being improved is the mutable skill surface in skill/deep-spec/SKILL.md. The point of the repo is not a single spec-generation run. The point is the live loop:
- generate a benchmark episode from frozen Symphony assets
- mutate the current
/deep-specskill - run the mutated skill through a realistic interview session
- score the visible transcript and final
SPEC.md - keep or revert the mutation
- repeat until the working skill climbs
The project is trying to make the mutable /deep-spec skill behave like a strong human-facing deep-spec interviewer:
- ask useful clarification questions
- reduce real implementation uncertainty
- avoid leading or repetitive questioning
- stop at the right time
- author a faithful, implementation-ready final
SPEC.md
Symphony is the frozen target spec and evaluator source. It gives the loop a stable benchmark so the hill climb is about improving the skill, not moving the target.
flowchart LR
W["Current working SKILL.md"] --> M["mutation_proposer LLM"]
S["seed_prompt_generator LLM"] --> E["Train Episode"]
E --> M
M --> C["Candidate SKILL.md"]
C --> I["deep_spec_interviewer LLM session"]
E --> I
F["Hidden Symphony truth"] --> U["simulator_strict or simulator_terse LLM session"]
I <--> U
I --> SPEC["Final SPEC.md"]
SPEC --> EV["Deterministic transcript + spec evaluators"]
CC["Cached control score for current working skill"] --> EV
EV --> K{"Keep mutation?"}
K -->|yes| NW["Advance working SKILL.md lineage"]
K -->|no| RW["Revert to current working skill"]
NW --> M
RW --> M
The hot path is candidate-only. A fresh control execution only happens when the control cache is cold for the current working skill and evaluation bundle.
All configured model roles currently point at gpt-5.4, but they play different jobs and can use different reasoning settings. See harness/config/models.yaml.
| Role | Main path? | What it does |
|---|---|---|
seed_prompt_generator |
yes | Turns frozen episode assets into one train seed prompt. |
mutation_proposer |
yes | Rewrites SKILL.md to propose the next candidate behavior. |
deep_spec_interviewer |
yes | Runs the interview, asks batches, decides when to stop, and authors the final SPEC.md. |
simulator_strict / simulator_terse |
yes | Plays the hidden user side of the interview using frozen Symphony truth plus persona style. |
audit_judge |
optional | Postmortem / debugging helper, not part of the core keep-revert loop. |
Important non-LLM components:
- transcript evaluator: deterministic, scores visible interview quality
- final-spec evaluator: deterministic, scores the authored
SPEC.mdagainst visible commitments - control cache: stores the current working skill's score on a bundle so every local iteration does not need a fresh baseline replay
sequenceDiagram
participant Research as "research_loop"
participant Seed as "seed_prompt_generator"
participant Mut as "mutation_proposer"
participant Int as "deep_spec_interviewer"
participant Sim as "simulator_strict / simulator_terse"
participant Eval as "deterministic evaluators"
Research->>Seed: frozen brief + episode inputs
Seed-->>Research: seed prompt
Research->>Mut: current SKILL.md + proposer packet
Mut-->>Research: candidate SKILL.md
Research->>Int: bootstrap with seed prompt + candidate skill
loop Interview turns
Int->>Sim: ask-user-questions batch
Sim-->>Int: plain reply text
end
Int-->>Research: final SPEC.md
Research->>Eval: visible transcript + final SPEC.md + cached control score
Eval-->>Research: keep or revert
deep_spec_interviewersees:- the mutable skill
- the visible transcript
- the seed prompt
- runtime budget / contract information
deep_spec_interviewerdoes not see:- hidden evaluator targets
- hidden Symphony truth
- simulator sees:
- hidden Symphony truth relevant to the session
- persona style
- the visible batch it is answering
- deterministic evaluators see:
- the visible transcript
- the final
SPEC.md - cached control scores
This repository is about repeated search, not a one-shot assistant demo.
The hill climb works because:
- the target stays frozen
- the mutable search surface stays narrow
- every accepted mutation has to clear deterministic quality floors
- the working skill only advances when it beats the current control on the same bundle
That gives you a live research loop for training prompt-and-policy behavior toward a faithful deep-spec interview process.
python -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
./.venv/bin/python -m pytest -qpnpm install
pnpm --dir ui build
./.venv/bin/python harness/serve_ui.py./scripts/run_simulated_session.sh --accepted-target 3 --skip-promotion-checks./scripts/run_simulated_session.sh --serve-ui --accepted-target 3 --skip-promotion-checks./scripts/run_simulated_session.sh --real-llm --accepted-target 3 --skip-promotion-checks./scripts/run_simulated_session.sh --real-llm --accepted-target 3./scripts/run_simulated_session.sh --real-llm --accepted-target 1 --max-attempts 1 --skip-promotion-checksThat live path expects a working local Codex CLI/auth setup.
./.venv/bin/python harness/serve_ui.py./.venv/bin/python harness/lint_skill.py --repo-root .
./.venv/bin/python harness/freeze_gate.pyThese paths are local caches and should not be committed:
runs/: per-run visible and audit artifactshistory/: SQLite history, temp model-call traces, and session stateresults.tsv: optional export generated on demand, not part of the live loop
If you want a clean slate between experiments, clear both runs/ and history/.
deep_spec/: runtime, evaluators, history, and UI backendharness/: CLI entrypoints and configskill/deep-spec/: mutable skill surfacetargets/symphony/: frozen benchmark assetstests/: unit and integration coveragedocs/: architecture and workflow docsui/: React monitor frontend
Recommended reading order:
docs/01-system-overview.mddocs/03-episodes-and-seed-prompts.mddocs/04-interview-runtime-and-session-messages.mddocs/07-evaluation-scoring-and-ablations.mddocs/08-evolution-loop-and-mutation-search.mddocs/09-artifacts-history-and-visibility.mddocs/10-monitor-ui-and-live-observability.mddocs/11-cli-and-developer-workflows.md
Please read:
If you change runtime behavior, evaluator logic, artifact shape, or operator workflows, update the matching docs in docs/ in the same change.
This project is licensed under the MIT License. See LICENSE.