Tilth

Prepare the ground, let the agent grow the work.

A minimal long-running agent harness against an OpenAI-compatible LLM endpoint. Tested today against OpenRouter; the OpenAI SDK underneath means other OpenAI-flavour gateways should work, but support for them is on the roadmap rather than validated. Built to learn (and demonstrate) the Brain/Hands/Session split, the Ralph loop, and the file-backed memory channels described in Addy Osmani's long-running agents and agent harness engineering posts.

Audience: This is an active research project for my work in Altered Craft. I do actively use it for real work, so I'd suggest it for single-dev / few-dev teams who want to understand what a long-running agent harness actually does. That's today (June 2026); the future, we shall see.

Target run: 10–60 minutes of autonomous work against an open model (default deepseek/deepseek-v4-flash on OpenRouter for the worker; the evaluator defaults to deepseek/deepseek-v4-pro), completing a short task list against a small project on a per-session git worktree.

Status — prompt-driven core. Tilth is deliberately small and currently being driven down to its essentials: a worker and an independent evaluator, the base file/search/bash tools, and full observability. There is no codified test/lint gate — the evaluator is the only gate — and no interview step: you author the work as markdown and run it. Capabilities get added back only as testing shows they're needed.

How Tilth differs

Many minimal coding agents are interactive — a developer watches the output and course-corrects, kills a bad run, or re-prompts. Tilth runs autonomously for the length of a run, with no one watching mid-task. That single difference is why it carries machinery a pair-programming agent can skip: an evaluator — a second model that judges whether a change is a proper solution against the task's acceptance criteria, not just whether the code runs; between-task caps that stand in for the budget ceiling a human would otherwise impose; a per-task evaluator ledger so a retried task sees the reviewer's prior verdicts; state kept out of the model's context; and offline-first observability (detailed just below). None of this is a knock on interactive agents; it's a different shape for a different job.

Hyper-observability

If no one is watching a run mid-flight, the recording is the supervision. Tilth's standing goal is hyper-observability — every prompt the harness sends is accessible, and every run is fully inspectable after the fact. Every assembled prompt, memory load, model call, and evaluator verdict lands in an append-only events.jsonl, and tilth visualize serves the whole thing as a local chat-style web app — tail an active run in near-realtime or replay a finished one end-to-end, with no state hidden out of reach.

A finished run, rendered by tilth visualize.

It's an early example of the goal, not a finished product. For the full product story — the Brain/Hands/Session split in detail, the memory channels, the two loops, and the worker↔evaluator dialogue — see the docs site. (The docs are mid-revision for the prompt-driven core; the README is the current source of truth for the run flow.)

Quickstart

git clone git@github.com:AlteredCraft/tilth.git
cd tilth
uv sync
cp .env.example .env
# edit .env — TILTH_BASE_URL, TILTH_API_KEY, TILTH_WORKER_MODEL are all required
# (Tilth refuses to start without them so a misconfigured run can't silently
# fall back to a provider/model your account doesn't have)

You author the feature as markdown in the target repo, then run it — there's no interview step. The work lives under <repo>/.tilth/tasks/:

.tilth/tasks/
├── overview.md            # the feature's goal + scope boundaries (required)
├── T-001-<slug>.md        # one file per task, ordered by id
├── T-002-<slug>.md
└── ...

Each task file is small frontmatter plus two sections:

---
id: T-001
title: Add the `add` subcommand
---

## Description
What to build, in the worker's voice. Real paths/symbols
(todo_cli/__main__.py:main()), not "the entrypoint".

## Acceptance criteria
- An externally checkable behaviour
- Another one

Then point Tilth at the repo:

git clone git@github.com:AlteredCraft/tilth-demo-todo-cli.git tilth-demo
# author tilth-demo/.tilth/tasks/  (run prints ready-to-fill templates if it's missing)
uv run tilth run ./tilth-demo

For each pending task, Tilth resets context from disk, lets the worker work with the file/search/bash tools until it calls submit_case, hands the case + diff to the evaluator in a fresh context, and on accept commits one task = one commit on the session/<id> branch (humans review and merge — Tilth never auto-merges). A run stops on all-tasks-done or a cap (iterations / wall-clock / tokens / evaluator calls). Interrupt with Ctrl-C; resume with tilth resume.

uv run tilth resume                 # continue the latest session
uv run tilth reset                  # tear down a session's worktree + branch + dir
uv run tilth visualize              # serve the live session viewer (127.0.0.1:8765)

The TILTH_* env-var table (caps, evaluator routing, context-file selection) is documented in .env.example.

Working with the codebase

# Lint
.venv/bin/python -m ruff check tilth/

# Tests
.venv/bin/python -m pytest

# Docs — live preview at http://127.0.0.1:8000
uv run --extra docs mkdocs serve

# Docs — strict build (the CI gate; catches broken nav refs, missing files, dead links)
uv run --extra docs mkdocs build --strict --site-dir /tmp/tilth-site

See CLAUDE.md for repo conventions and the architecture invariants worth preserving when editing the harness itself.

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.claude		.claude
.github/workflows		.github/workflows
docs		docs
documents		documents
proposals		proposals
tests		tests
tilth		tilth
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tilth

How Tilth differs

Hyper-observability

Quickstart

Working with the codebase

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tilth

How Tilth differs

Hyper-observability

Quickstart

Working with the codebase

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages