An experimental framework for autonomous coding agent loops with persistent memory.
This repository demonstrates how to design systems where AI agents discover work, execute changes, verify results, and learn from failures — without manual prompting at each step.
Traditional AI coding assistance follows a simple pattern: you prompt, the agent responds, you prompt again. Loop engineering replaces yourself as the person who prompts the agent. Instead, you design the system that does it.
This repo implements three autonomous loops:
- Audit Loop — An expensive model reviews the codebase and generates improvement plans
- Execution Loop — A cheaper model executes plans in isolated worktrees with TDD
- Documentation Loop — Keeps a living wiki synchronized with code changes
The loops are coordinated through:
- Skills — Reusable procedures (TDD, diagnosis, architecture improvement)
- Sub-agents — Separate maker/checker/goal-checker roles to prevent self-grading
- Safety layer — Andon (line-stop) + Kaizen (continuous improvement) from lean manufacturing
- Persistent memory — An LLM-maintained wiki that compounds knowledge across sessions
A loop is a recursive goal where the agent iterates until complete. The five pieces (from Addy Osmani):
- Automations — Discovery and triage on a schedule
- Worktrees — Isolated directories so parallel agents don't collide
- Skills — Project knowledge the agent would otherwise guess
- Plugins/Connectors — Integration with external tools (MCP)
- Sub-agents — One agent has the idea, another checks it
- State — Markdown files that remember what's done and what's next
Instead of re-deriving knowledge from raw documents on every query (RAG), the LLM incrementally builds and maintains a persistent wiki — a structured, interlinked collection of markdown files that sits between you and the raw sources.
The wiki is a persistent, compounding artifact. Cross-references are already there. Contradictions have already been flagged. The synthesis already reflects everything that's been read.
Three layers:
- Raw sources (immutable) — Code, tests, PRs, articles
- Wiki (LLM-maintained) — Compiled pages, linked, synthesized
- Schema (this repo) — Rules and conventions
Applied from Toyota Production System:
Jidoka (Autonomation) — When something fails, stop the line immediately:
- Block forward-progress (push, deploy, merge)
- Classify the failure (7 categories with confidence scores)
- Generate Five Whys analysis artifacts
- Do not continue until root cause is resolved
Kaizen (Continuous Improvement) — Every failure is standardized learning:
- Incident → Analysis → Prevention → Standard
- Prevention levels: L1 (poka-yoke) > L2 (auto-detect) > L3 (document) > L4 (alert)
- Meta-Andon: 3+ consecutive failures = mandatory Plan Mode
The model that wrote the code is too nice grading its own homework. This repo enforces three separate roles:
- Maker — Implements changes following skills and plans
- Checker — Verifies work against spec, tests, and standards (independent model)
- Goal-Checker — Evaluates whether the stopping condition is met (another independent model)
Requirements are not silently relaxed:
- Tests are not modified to "pass"
- Spec changes require explicit approval
- Violation = line stop + rollback
The agent executes with purpose, not to pass checks:
- Understand the PURPOSE of the phase
- Define deliverables and expected quality
- Create deliverables comprehensively
- Self-assess quality
- Submit to gate (without having looked at conditions)
┌─────────────────────────────────────────────────────────────┐
│ AUDIT LOOP (improve) │
│ Expensive model reviews codebase → generates plans │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ EXECUTION LOOP │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ MAKER │───▶│ CHECKER │───▶│ GOAL-CHECKER │ │
│ │(implements)│ │(verifies)│ │(evaluates goal)│ │
│ └──────────┘ └──────────┘ └──────────────┘ │
│ │ │ │ │
│ └────────────────┴──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ ANDON LAYER │ │
│ │ - Line stop │ │
│ │ - Five Whys │ │
│ │ - Kaizen │ │
│ │ - Meta-Andon │ │
│ └──────────────────┘ │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DOCUMENTATION LOOP │
│ Keeps wiki aligned with code changes │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LLM WIKI (Memory) │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ 10_DOMAIN │ │ 20_SYSTEM │ │ 90_LOG │ │
│ │ (concepts) │ │(architecture)│ │ (timeline) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────┘
- Node.js 18+
- pnpm (or npm/yarn)
- Git
- An AI coding agent (Claude Code, OpenAI Codex, OpenCode, etc.)
# Clone the repository
git clone <repo-url>
cd loop-lab
# Install dependencies
pnpm install
# Run tests to verify setup
pnpm test
# Run type checking
pnpm typecheck-
Configure your agent in
.agents/(Claude Code, Codex, etc.) -
Run an audit using the
improve-architectureskill:# Tell your agent: "Run an audit using skills/improve-architecture.md and generate plans in plans/"
-
Review generated plans in
plans/directory -
Execute a plan using the TDD skill:
# Tell your agent: "Execute plans/2026-06-pricing-input-validation.md using skills/tdd.md"
-
Review the diff and merge to main
-
Check the wiki — it should be updated with new concepts and logs
loop-lab/
├── AGENTS.md # Operating principles and rules
├── README.md # This file
├── package.json # Dependencies and scripts
├── tsconfig.json # TypeScript configuration
│
├── src/ # Source code (example)
│ ├── pricing.ts # Discount calculation module
│ ├── pricing.test.ts # Tests for pricing
│ ├── currency.ts # Currency formatting module
│ └── currency.test.ts # Tests for currency
│
├── agents/ # Agent configuration
│ ├── CONTEXT.md # Current state and rules
│ ├── LOOPS.md # Loop definitions
│ ├── GOALS.md # Success conditions
│ └── LANGUAGE.md # Controlled vocabulary
│
├── skills/ # Reusable procedures
│ ├── tdd.md # Test-driven development
│ ├── diagnose.md # Bug diagnosis
│ ├── improve-architecture.md # Codebase audit
│ └── andon.md # Safety layer (Jidoka + Kaizen)
│
├── plans/ # Generated improvement plans
│ ├── TEMPLATE.md # Plan template
│ ├── 2026-06-pricing-input-validation.md
│ └── 2026-06-separate-currency-module.md
│
├── wiki/ # LLM-maintained knowledge base
│ ├── 00_SCHEMA.md # Wiki conventions
│ ├── index.md # Page catalog
│ ├── log.md # Chronological timeline
│ ├── 10_DOMAIN/ # Business domain concepts
│ │ ├── pricing-module.md
│ │ └── currency-module.md
│ ├── 20_SYSTEM/ # Architecture and decisions
│ │ └── input-validation.md
│ └── 90_LOG/ # Loop execution logs
│ ├── 2026-06-10-first-loop.md
│ └── 2026-06-10-second-loop.md
│
└── .agents/ # Per-tool configuration
└── claude/agents/ # Claude Code sub-agents
├── maker.md # Implements changes
├── checker.md # Verifies work
└── goal-checker.md # Evaluates stopping condition
Objective: Review codebase and generate improvement plans.
Model: Expensive (Claude Sonnet/Opus, GPT-4o, o3)
Process:
- Traverse
src/,tests/,wiki/,agents/CONTEXT.md - Identify bugs, technical debt, documentation gaps, refactor opportunities
- Generate plans in
plans/YYYY-MM-<name>.md - Update wiki if new concepts emerge
Cadence: Manual at start, then cron/weekly
Circuit breaker: Max 5 plans per execution. Plans > 10 chunks = reject and request split.
Objective: Take a plan and execute it safely.
Model: Cheap (Claude Haiku, GPT-4o-mini, Codex mini)
Process:
- Select plan from
plans/ - Create worktree:
git worktree add ../loop-<plan> -b loop/<plan> - For each chunk:
- Apply corresponding skill (tdd, diagnose, etc.)
- Maker implements → Checker reviews
- Run tests + linters
- If failure: Andon activates
- Verify goal
- If goal met: generate diff, update wiki, log in 90_LOG
- If goal NOT met after 3 attempts: Meta-Andon (Plan Mode)
Cadence: On demand or when pending plans exist
Circuit breaker:
- 3 consecutive failures = Meta-Andon
- 10 min without progress = abort
- Files modified outside scope = rollback
Objective: Keep wiki and CONTEXT.md aligned with code.
Model: Cheap
Process:
- Read recent changes (
git log --since="1 week") - Update relevant pages in
wiki/20_SYSTEM/ - Update
agents/CONTEXT.mdif architecture changes - Run wiki lint (contradictions, orphans, gaps, stale claims)
- Generate report in
wiki/90_LOG/
Cadence: Post-merge or weekly
Circuit breaker: If lint detects > 10 issues, no auto-fix. Report and wait for human.
Disciplined test-driven development:
- Red — Write ONE failing test
- Green — Minimal implementation to pass
- Refactor — Improve while keeping tests green
Anti-spec-drift rules:
- NEVER modify existing tests to make them pass
- NEVER delete failing tests
- If existing test fails → your implementation is wrong
Structured bug diagnosis:
- Observe — Read full error, identify when it started, reproduce in isolation
- Hypothesize — Formulate 2-3 root cause hypotheses
- Verify — Add instrumentation, confirm or refute
- Fix — Minimal fix for confirmed root cause
- Post-mortem — Five Whys, classify failure, standardize prevention
Codebase audit procedure:
- Explore — Read context, vocabulary, structure, existing wiki
- Analyze — Module depth, coupling, test coverage, tech debt, docs, API surface
- Prioritize — Impact/effort matrix, group into coherent plans
- Write plans — Context, chunks, scope, risks, checks, goal
- Validate — Independent chunks, realistic scope, objective goal
Safety layer from Toyota Production System:
- Jidoka — Stop the line on failure, classify, analyze root cause
- Kaizen — Standardize learning (L1-L4 prevention levels)
- Meta-Andon — Detect repeated failure patterns (3+ consecutive)
- Spec-Drift Guard — Prevent silent requirement relaxation
- Gate-Gaming Prevention — Execute with purpose, not to pass checks
Implements changes following skills and plans.
Responsibilities:
- Read assigned plan chunk
- Follow corresponding skill
- Implement code
- Run tests
- Follow Andon flow on failures
Does NOT:
- Verify overall goal (Checker's job)
- Decide if work is "done" (Goal-Checker's job)
- Modify existing tests to pass
- Change plan scope
Verifies Maker's work independently.
Verification checklist:
- Scope validation (only planned files modified?)
- Tests (all pass? meaningful? existing tests unchanged?)
- Code quality (linter, typecheck, no commented code?)
- Spec compliance (satisfies purpose? no gate gaming?)
- Documentation (wiki updated? decisions documented?)
Decisions: APPROVE or REJECT with clear reasons
Evaluates whether stopping condition is met.
Process:
- Read goal definition
- Execute objective verifications (exit codes, file checks)
- Report MET or NOT MET with evidence
Does NOT:
- Opine on code quality
- Suggest improvements
- Modify anything
Automatic safety stops to prevent runaway loops:
| Trigger | Action |
|---|---|
| 3+ consecutive failures | Meta-Andon: Plan Mode |
| 2 failures involving user | Line stop, requires hypothesis |
| 10+ minutes without progress | Abort + document in 90_LOG |
| 5+ files modified outside scope | Rollback + review plan |
| Spec drift detected | Stop + rollback + alert |
Here's what happened when the audit loop ran on this repo:
Findings:
calculateDiscountdid not validate price >= 0 (silent bug)formatCurrencydid not validate amount >= 0formatCurrencymixed with pricing (distinct responsibilities)- No JSDoc on public functions
- Empty wiki for pricing concepts
- Missing tests for edge cases
Plans Generated:
plans/2026-06-pricing-input-validation.md— Input validationplans/2026-06-separate-currency-module.md— Separate currency module
Chunk 1: Validate price >= 0
- TDD: red test → implementation → green
- Added:
if (price < 0) throw new Error("Price must be non-negative")
Chunk 2: Validate amount >= 0
- TDD: red test → implementation → green
- Added:
if (amount < 0) throw new Error("Amount must be non-negative")
Chunk 3: Verify propagation in applyBulkDiscount
- Test added verification by composition (no new code needed)
Checker Verdict: APPROVED
- Tests: 12 passed, 0 failed
- Typecheck: clean
- Scope: valid
- Spec drift: none
Chunk 1: Create currency.ts
- Extracted formatCurrency with validation
Chunk 2: Create currency.test.ts
- Moved 4 tests from pricing.test.ts
Chunk 3: Remove from pricing.ts
- Removed function and tests
- Updated imports
Checker Verdict: APPROVED
- Tests: 12 passed (8 pricing + 4 currency), 0 failed
- Typecheck: clean
- Scope: valid
- Spec drift: none
Wiki Updated:
- Created
10_DOMAIN/pricing-module.md - Created
10_DOMAIN/currency-module.md - Created
20_SYSTEM/input-validation.md - Created
90_LOG/2026-06-10-first-loop.md - Created
90_LOG/2026-06-10-second-loop.md - Updated
index.mdandlog.md
- No changes to main without human review
- Every architecture decision → wiki + reference plan
- Loops can fail: document cause in 90_LOG
- Large plans → split before executing
- Agents must read
agents/GOALS.mdbefore starting work - Agents must follow corresponding skill
- Agents must update wiki for new concepts
- Agents must respect circuit breakers
This project synthesizes ideas from:
- Addy Osmani: Loop Engineering — The five pieces of a loop, sub-agent separation, cognitive surrender
- Andrej Karpathy: LLM Wiki — Persistent, compounding knowledge base maintained by LLMs
- Matt Pocock: Skills — Reusable procedures for AI agents
- shadcn: Improve — Audit-driven codebase improvement
- Andon for LLM Agents — Toyota Production System applied to coding agents
- Peter Steinberger — "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."
- Boris Cherny (Claude Code) — "I don't prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops."
Current phase: 2 loops completed
Completed plans:
- ✅
pricing-input-validation— Added input validation with TDD - ✅
separate-currency-module— Extracted currency formatting to own module
Test coverage: 12 tests passing (8 pricing + 4 currency)
Wiki: 5 pages (2 domain, 1 system, 2 logs)
Next steps:
- Add more example code to audit
- Run another audit cycle
- Integrate with CI/CD for automated loop execution
- Add MCP connectors for external tools (Linear, Slack, etc.)
MIT
This is an experimental repository. Contributions welcome in:
- Additional skills
- More example code
- Integration with other AI coding tools
- Documentation improvements
- Case studies of loops applied to real projects