Skip to content

simke9445/deep-spec

deep-spec

deep-spec is an autoresearch-inspired evolutionary hill-climbing project for producing a faithful /deep-spec interview skill.

The target is frozen in targets/symphony/. The thing being improved is the mutable skill surface in skill/deep-spec/SKILL.md. The point of the repo is not a single spec-generation run. The point is the live loop:

  • generate a benchmark episode from frozen Symphony assets
  • mutate the current /deep-spec skill
  • run the mutated skill through a realistic interview session
  • score the visible transcript and final SPEC.md
  • keep or revert the mutation
  • repeat until the working skill climbs

What This Project Is Training

The project is trying to make the mutable /deep-spec skill behave like a strong human-facing deep-spec interviewer:

  • ask useful clarification questions
  • reduce real implementation uncertainty
  • avoid leading or repetitive questioning
  • stop at the right time
  • author a faithful, implementation-ready final SPEC.md

Symphony is the frozen target spec and evaluator source. It gives the loop a stable benchmark so the hill climb is about improving the skill, not moving the target.

Loop Overview

flowchart LR
    W["Current working SKILL.md"] --> M["mutation_proposer LLM"]
    S["seed_prompt_generator LLM"] --> E["Train Episode"]
    E --> M
    M --> C["Candidate SKILL.md"]
    C --> I["deep_spec_interviewer LLM session"]
    E --> I
    F["Hidden Symphony truth"] --> U["simulator_strict or simulator_terse LLM session"]
    I <--> U
    I --> SPEC["Final SPEC.md"]
    SPEC --> EV["Deterministic transcript + spec evaluators"]
    CC["Cached control score for current working skill"] --> EV
    EV --> K{"Keep mutation?"}
    K -->|yes| NW["Advance working SKILL.md lineage"]
    K -->|no| RW["Revert to current working skill"]
    NW --> M
    RW --> M
Loading

The hot path is candidate-only. A fresh control execution only happens when the control cache is cold for the current working skill and evaluation bundle.

LLM Roles

All configured model roles currently point at gpt-5.4, but they play different jobs and can use different reasoning settings. See harness/config/models.yaml.

Role Main path? What it does
seed_prompt_generator yes Turns frozen episode assets into one train seed prompt.
mutation_proposer yes Rewrites SKILL.md to propose the next candidate behavior.
deep_spec_interviewer yes Runs the interview, asks batches, decides when to stop, and authors the final SPEC.md.
simulator_strict / simulator_terse yes Plays the hidden user side of the interview using frozen Symphony truth plus persona style.
audit_judge optional Postmortem / debugging helper, not part of the core keep-revert loop.

Important non-LLM components:

  • transcript evaluator: deterministic, scores visible interview quality
  • final-spec evaluator: deterministic, scores the authored SPEC.md against visible commitments
  • control cache: stores the current working skill's score on a bundle so every local iteration does not need a fresh baseline replay

LLM Interaction In One Attempt

sequenceDiagram
    participant Research as "research_loop"
    participant Seed as "seed_prompt_generator"
    participant Mut as "mutation_proposer"
    participant Int as "deep_spec_interviewer"
    participant Sim as "simulator_strict / simulator_terse"
    participant Eval as "deterministic evaluators"

    Research->>Seed: frozen brief + episode inputs
    Seed-->>Research: seed prompt
    Research->>Mut: current SKILL.md + proposer packet
    Mut-->>Research: candidate SKILL.md
    Research->>Int: bootstrap with seed prompt + candidate skill
    loop Interview turns
        Int->>Sim: ask-user-questions batch
        Sim-->>Int: plain reply text
    end
    Int-->>Research: final SPEC.md
    Research->>Eval: visible transcript + final SPEC.md + cached control score
    Eval-->>Research: keep or revert
Loading

Who Sees What

  • deep_spec_interviewer sees:
    • the mutable skill
    • the visible transcript
    • the seed prompt
    • runtime budget / contract information
  • deep_spec_interviewer does not see:
    • hidden evaluator targets
    • hidden Symphony truth
  • simulator sees:
    • hidden Symphony truth relevant to the session
    • persona style
    • the visible batch it is answering
  • deterministic evaluators see:
    • the visible transcript
    • the final SPEC.md
    • cached control scores

Why The Loop Matters

This repository is about repeated search, not a one-shot assistant demo.

The hill climb works because:

  • the target stays frozen
  • the mutable search surface stays narrow
  • every accepted mutation has to clear deterministic quality floors
  • the working skill only advances when it beats the current control on the same bundle

That gives you a live research loop for training prompt-and-policy behavior toward a faithful deep-spec interview process.

Quickstart

Python

python -m venv .venv
source .venv/bin/activate
pip install -e '.[dev]'
./.venv/bin/python -m pytest -q

UI

pnpm install
pnpm --dir ui build
./.venv/bin/python harness/serve_ui.py

Common Commands

Fake local hill-climbing loop

./scripts/run_simulated_session.sh --accepted-target 3 --skip-promotion-checks

Fake loop with UI

./scripts/run_simulated_session.sh --serve-ui --accepted-target 3 --skip-promotion-checks

Real-LLM local hill-climbing loop

./scripts/run_simulated_session.sh --real-llm --accepted-target 3 --skip-promotion-checks

Full loop with promotion checks

./scripts/run_simulated_session.sh --real-llm --accepted-target 3

Single-iteration debug run

./scripts/run_simulated_session.sh --real-llm --accepted-target 1 --max-attempts 1 --skip-promotion-checks

That live path expects a working local Codex CLI/auth setup.

Run the UI only

./.venv/bin/python harness/serve_ui.py

Skill lint and benchmark preflight

./.venv/bin/python harness/lint_skill.py --repo-root .
./.venv/bin/python harness/freeze_gate.py

Generated State

These paths are local caches and should not be committed:

  • runs/: per-run visible and audit artifacts
  • history/: SQLite history, temp model-call traces, and session state
  • results.tsv: optional export generated on demand, not part of the live loop

If you want a clean slate between experiments, clear both runs/ and history/.

Repository Layout

Documentation Guide

Recommended reading order:

  1. docs/01-system-overview.md
  2. docs/03-episodes-and-seed-prompts.md
  3. docs/04-interview-runtime-and-session-messages.md
  4. docs/07-evaluation-scoring-and-ablations.md
  5. docs/08-evolution-loop-and-mutation-search.md
  6. docs/09-artifacts-history-and-visibility.md
  7. docs/10-monitor-ui-and-live-observability.md
  8. docs/11-cli-and-developer-workflows.md

Contributing

Please read:

If you change runtime behavior, evaluator logic, artifact shape, or operator workflows, update the matching docs in docs/ in the same change.

License

This project is licensed under the MIT License. See LICENSE.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors