deep-spec

deep-spec is an autoresearch-inspired evolutionary hill-climbing project for producing a faithful /deep-spec interview skill.

The target is frozen in targets/symphony/. The thing being improved is the mutable skill surface in skill/deep-spec/SKILL.md. The point of the repo is not a single spec-generation run. The point is the live loop:

generate a benchmark episode from frozen Symphony assets
mutate the current /deep-spec skill
run the mutated skill through a realistic interview session
score the visible transcript and final SPEC.md
keep or revert the mutation
repeat until the working skill climbs

What This Project Is Training

The project is trying to make the mutable /deep-spec skill behave like a strong human-facing deep-spec interviewer:

ask useful clarification questions
reduce real implementation uncertainty
avoid leading or repetitive questioning
stop at the right time
author a faithful, implementation-ready final SPEC.md

Symphony is the frozen target spec and evaluator source. It gives the loop a stable benchmark so the hill climb is about improving the skill, not moving the target.

Loop Overview

flowchart LR
    W["Current working SKILL.md"] --> M["mutation_proposer LLM"]
    S["seed_prompt_generator LLM"] --> E["Train Episode"]
    E --> M
    M --> C["Candidate SKILL.md"]
    C --> I["deep_spec_interviewer LLM session"]
    E --> I
    F["Hidden Symphony truth"] --> U["simulator_strict or simulator_terse LLM session"]
    I <--> U
    I --> SPEC["Final SPEC.md"]
    SPEC --> EV["Deterministic transcript + spec evaluators"]
    CC["Cached control score for current working skill"] --> EV
    EV --> K{"Keep mutation?"}
    K -->|yes| NW["Advance working SKILL.md lineage"]
    K -->|no| RW["Revert to current working skill"]
    NW --> M
    RW --> M

The hot path is candidate-only. A fresh control execution only happens when the control cache is cold for the current working skill and evaluation bundle.

LLM Roles

All configured model roles currently point at gpt-5.4, but they play different jobs and can use different reasoning settings. See harness/config/models.yaml.

Role	Main path?	What it does
`seed_prompt_generator`	yes	Turns frozen episode assets into one train seed prompt.
`mutation_proposer`	yes	Rewrites `SKILL.md` to propose the next candidate behavior.
`deep_spec_interviewer`	yes	Runs the interview, asks batches, decides when to stop, and authors the final `SPEC.md`.
`simulator_strict` / `simulator_terse`	yes	Plays the hidden user side of the interview using frozen Symphony truth plus persona style.
`audit_judge`	optional	Postmortem / debugging helper, not part of the core keep-revert loop.

Important non-LLM components:

transcript evaluator: deterministic, scores visible interview quality
final-spec evaluator: deterministic, scores the authored SPEC.md against visible commitments
control cache: stores the current working skill's score on a bundle so every local iteration does not need a fresh baseline replay

LLM Interaction In One Attempt

sequenceDiagram
    participant Research as "research_loop"
    participant Seed as "seed_prompt_generator"
    participant Mut as "mutation_proposer"
    participant Int as "deep_spec_interviewer"
    participant Sim as "simulator_strict / simulator_terse"
    participant Eval as "deterministic evaluators"

    Research->>Seed: frozen brief + episode inputs
    Seed-->>Research: seed prompt
    Research->>Mut: current SKILL.md + proposer packet
    Mut-->>Research: candidate SKILL.md
    Research->>Int: bootstrap with seed prompt + candidate skill
    loop Interview turns
        Int->>Sim: ask-user-questions batch
        Sim-->>Int: plain reply text
    end
    Int-->>Research: final SPEC.md
    Research->>Eval: visible transcript + final SPEC.md + cached control score
    Eval-->>Research: keep or revert

Who Sees What

deep_spec_interviewer sees:
- the mutable skill
- the visible transcript
- the seed prompt
- runtime budget / contract information
deep_spec_interviewer does not see:
- hidden evaluator targets
- hidden Symphony truth
simulator sees:
- hidden Symphony truth relevant to the session
- persona style
- the visible batch it is answering
deterministic evaluators see:
- the visible transcript
- the final SPEC.md
- cached control scores

Why The Loop Matters

This repository is about repeated search, not a one-shot assistant demo.

The hill climb works because:

the target stays frozen
the mutable search surface stays narrow
every accepted mutation has to clear deterministic quality floors
the working skill only advances when it beats the current control on the same bundle

That gives you a live research loop for training prompt-and-policy behavior toward a faithful deep-spec interview process.

Quickstart

Python

python -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
./.venv/bin/python -m pytest -q

UI

pnpm install
pnpm --dir ui build
./.venv/bin/python harness/serve_ui.py

Common Commands

Fake local hill-climbing loop

./scripts/run_simulated_session.sh --accepted-target 3 --skip-promotion-checks

Fake loop with UI

./scripts/run_simulated_session.sh --serve-ui --accepted-target 3 --skip-promotion-checks

Real-LLM local hill-climbing loop

./scripts/run_simulated_session.sh --real-llm --accepted-target 3 --skip-promotion-checks

Full loop with promotion checks

./scripts/run_simulated_session.sh --real-llm --accepted-target 3

Single-iteration debug run

./scripts/run_simulated_session.sh --real-llm --accepted-target 1 --max-attempts 1 --skip-promotion-checks

That live path expects a working local Codex CLI/auth setup.

Run the UI only

./.venv/bin/python harness/serve_ui.py

Skill lint and benchmark preflight

./.venv/bin/python harness/lint_skill.py --repo-root .
./.venv/bin/python harness/freeze_gate.py

Generated State

These paths are local caches and should not be committed:

runs/: per-run visible and audit artifacts
history/: SQLite history, temp model-call traces, and session state
results.tsv: optional export generated on demand, not part of the live loop

If you want a clean slate between experiments, clear both runs/ and history/.

Repository Layout

deep_spec/: runtime, evaluators, history, and UI backend
harness/: CLI entrypoints and config
skill/deep-spec/: mutable skill surface
targets/symphony/: frozen benchmark assets
tests/: unit and integration coverage
docs/: architecture and workflow docs
ui/: React monitor frontend

Documentation Guide

Contributing

Please read:

If you change runtime behavior, evaluator logic, artifact shape, or operator workflows, update the matching docs in docs/ in the same change.

License

This project is licensed under the MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
deep_spec		deep_spec
docs		docs
harness		harness
scripts		scripts
skill/deep-spec		skill/deep-spec
targets/symphony		targets/symphony
tests		tests
ui		ui
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yaml		docker-compose.yaml
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deep-spec

What This Project Is Training

Loop Overview

LLM Roles

LLM Interaction In One Attempt

Who Sees What

Why The Loop Matters

Quickstart

Python

UI

Common Commands

Fake local hill-climbing loop

Fake loop with UI

Real-LLM local hill-climbing loop

Full loop with promotion checks

Single-iteration debug run

Run the UI only

Skill lint and benchmark preflight

Generated State

Repository Layout

Documentation Guide

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

deep-spec

What This Project Is Training

Loop Overview

LLM Roles

LLM Interaction In One Attempt

Who Sees What

Why The Loop Matters

Quickstart

Python

UI

Common Commands

Fake local hill-climbing loop

Fake loop with UI

Real-LLM local hill-climbing loop

Full loop with promotion checks

Single-iteration debug run

Run the UI only

Skill lint and benchmark preflight

Generated State

Repository Layout

Documentation Guide

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages