Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,25 @@ python assets/task_showcase/app.py \

---

## 🧠 Skill Library (reuse solved tasks across tasks)

[`webwright.skills`](src/webwright/skills/) turns solved tasks into **reusable, executable code
skills**, retrieves and judges them at solve time, gates what enters the library, and grows the
library incrementally — a self-evolving *store → retrieve → use/adapt → gate → evolve* loop on top
of Webwright's code-as-action solves. Plugs in with **no change to the agent loop**:

- **Reuse** — the agent calls `python -m webwright.tools.skill_use --task "..." --library ...`
(like `self_reflection`/`image_qa`); it returns `{verdict: use|adapt|skip, source_path}`.
- **Grow** — `python -m webwright.skills.update --manifest batch.json --library ./library`
distills a batch of gate-passed solves into a parameterized, primitive-decomposed skill.

Validated end-to-end on a real public website (read-only GitHub): solve two repos from scratch →
`update` builds a parameterized skill → a held-out repo is solved by reusing it (agent calls
`skill_use`, verdict `use`, answer correct); a wrong solve is kept out by the gate; a second batch
improves the existing skill in place. See [`src/webwright/skills/README.md`](src/webwright/skills/README.md).

---

## 🚀 Quick Start

### Prerequisites
Expand Down
13 changes: 13 additions & 0 deletions src/webwright/config/skill_mode.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Skill-library mode (optional overlay): raise the step budget headroom for skill reuse runs.
# Enable by stacking: webwright run ... -c base.yaml -c model_openai.yaml -c skill_mode.yaml
#
# How the agent is told to reuse skills: the SKILL-LIBRARY block is prepended to the TASK prompt
# by the caller (see webwright.skills.prompt.with_skill_hint), NOT injected via system_template —
# webwright merges system_template by replacement, so a prompt-level hint is the clean, non-invasive
# way and keeps webwright's default behavior unchanged when this mode is off.
#
# Requires env:
# SKILL_LIBRARY_ROOT path to the skill library (read by the skill_use tool)
# SKILL_MODEL_NAME / SKILL_MODEL_ENDPOINT (optional) backend for skill_use; defaults to OPENAI_*
agent:
step_limit: 100
257 changes: 257 additions & 0 deletions src/webwright/skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
# `webwright.skills` — a memory / skill-library module for Webwright

Turn solved tasks into **reusable, executable code skills**, retrieve and judge them at solve
time, gate what enters the library, and grow the library incrementally. A self-evolving loop:

```
solve --(gate: gold | self_verify)--> admit --(evolve: refine + parameterize + primitives)--> library
^ |
| skill_use tool: retrieve + decide (use / adapt / skip) <--------------------------------------- +
```

This is the missing **reuse + accumulation** layer: it consumes the `final_script.py` every
Webwright solve already produces (plain or crafted mode — both work), accumulates skills across
tasks, judges when a prior skill applies, and improves skills as more solves arrive — with a gate
so wrong solves don't pollute the library. It complements `crafted_cli`: where `crafted_cli`
parameterizes a single task's script by anticipating what might vary, `update.refine`
parameterizes **across multiple verified solves** — the differences actually observed between
instances become the parameters.

## How it plugs into Webwright

Two touch points, **no change to the agent loop or default config**:

1. **Reuse at solve time — the `skill_use` tool.** The agent invokes it from bash, exactly like
`self_reflection` / `image_qa`:
```bash
python -m webwright.tools.skill_use --task "<the task>" --library "$SKILL_LIBRARY_ROOT"
```
It returns JSON `{verdict: use|adapt|skip, skill_id, source_path, how_to_reuse}`. The agent
reads `source_path` and reuses the skill (use = as-is, adapt = reuse core + change last step,
skip = solve from scratch). `webwright.skills.with_skill_hint(prompt, ...)` prepends a one-line
usage hint to the task prompt so the agent remembers to query the library first.

2. **Growth after solving — the `update` CLI.** Distill a batch of gate-passed solves into a
library skill (offline, not in the solve loop):
```bash
python -m webwright.skills.update --manifest batch.json --library ./library
```
Manifest schema and a full walkthrough: see **How to use** below.

## How to use (end-to-end)

The library grows **offline** from batches of solved tasks, and is consumed **at solve time** by
the agent. Tasks are provided **manually** today — you pick which tasks to solve and batch. The
current focus is **same-template generalization**, so feed several instances of the SAME template
(3+ instances with different parameter values works well): `refine` aligns them, and exactly what
differs between instances becomes the skill's parameters — more instances, wider generalization.
(Planned: bootstrap — automatically expand one seed task into multiple instances.)

### 1. Solve a few instances of a template (normal Webwright runs)

```bash
python -m webwright.run.cli main \
-t "How many commits did kilian make to a11yproject on 3/1/2023?" \
--task-id t132_a --start-url http://gitlab.example.com -o outputs \
-c base.yaml -c model_openai.yaml
```

Each run leaves a directory containing `final_script.py` (the executable solve) and
`agent_response.json` (the answer). Repeat for 2–3 more instances of the same template with
different values (another user / repo / date).

### 2. Gate the solves, write the manifest

Judge each run (`gate(result, method="gold")` against a known answer, or `method="self_verify"`
without one) and write one manifest per batch:

```jsonc
// batch.json
{
"template": "How many commits did {{user}} make to {{repo}} on {{date}}?",
"runs": [
{
"dir": "outputs/t132_a_20260703_120000", // run dir; final_script.py is read from it
"admit": true, // gate verdict — false rows NEVER enter the library
"params": {"user": "kilian", "repo": "a11yproject", "date": "3/1/2023"},
"verdict": "skip", // how the run used the library: skip = solved from
// scratch; use / adapt = reused a skill
// (adapt triggers refine-back into the skill)
"site": "gitlab",
"output_schema": {"type": "number"} // required shape of retrieved_data
// "answer": optional — read from the run dir's agent_response.json when omitted
},
{ "dir": "outputs/t132_b_20260703_121500", "admit": true,
"params": {"user": "gao", "repo": "2019", "date": "4/6/2023"},
"verdict": "skip", "site": "gitlab", "output_schema": {"type": "number"} }
]
}
```

Field by field:

| field | required | meaning |
|---|---|---|
| `template` | yes | the template sentence with `{{param}}` placeholders. **Skills are keyed by it**: a manifest whose template already has a skill refines that skill in place; a new template adds a new skill. Use the same string across batches of the same template. |
| `runs[].dir` | yes | a Webwright run directory; `final_script.py` is read from it |
| `runs[].admit` | yes | the gate verdict; `false` rows are dropped and never enter the library |
| `runs[].params` | yes | this instance's concrete values — `refine` aligns the runs and exposes exactly these differing values as the skill's arguments (this is what powers generalization) |
| `runs[].verdict` | no (default `skip`) | how this run used the library: `skip` = solved from scratch; `use` = reused a skill as-is; `adapt` = reused + fixed the last step (**`adapt` is what triggers refining the fix back into the skill**) |
| `runs[].site` | no | site tag stored in the skill's meta (helps retrieval) |
| `runs[].output_schema` | no | required shape of `retrieved_data`, e.g. `{"type": "number"}` |
| `runs[].answer` | no | this run's answer; read from the run dir's `agent_response.json` when omitted |

### 3. Build / evolve the library

```bash
export OPENAI_API_KEY=... # backend key (never stored by the module)
# optional — defaults to OPENAI_MODEL / OPENAI_ENDPOINT:
export SKILL_MODEL_NAME=gpt-5.4 SKILL_MODEL_ENDPOINT=https://api.openai.com/v1/responses
python -m webwright.skills.update --manifest batch.json --library ./library
```

Prints a changelog: `{"added": [...], "adapt_refined": [...], "use": [...], "dropped_wrong": n}`.
Re-run with later batches any time — a new template **adds** a skill, new solves for an existing
template **refine it in place** (keeps its working functions), templates with no new traces are
left untouched. Batches may mix templates.

### 4. Reuse at solve time

```python
from webwright.skills import with_skill_hint
prompt = with_skill_hint(prompt, task=task_text, library="./library")
```

```bash
SKILL_LIBRARY_ROOT=./library python -m webwright.run.cli main -t "$prompt" ...
```

The hint tells the agent to query the library first; the agent runs the `skill_use` tool, gets
`{verdict, skill_id, source_path, how_to_reuse}`, reads the skill source, and reuses it
(use = as-is with new parameter values, adapt = reuse the core + change the last step,
skip = solve from scratch).

### 5. Run a skill directly (optional)

Every skill is also a standalone script: it reads a `taskspec.json` (parameters at run time) and
writes `agent_response.json`:

```bash
cat > taskspec.json <<'EOF'
{"params": {"user": "byte", "repo": "empathy-prompts", "date": "4/2/2023"},
"start_url": "http://gitlab.example.com", "credentials": null,
"output_schema": {"type": "number"}}
EOF
python library/how_many_commits_did_user_make_to_repo_on_date/skill.py taskspec.json
cat agent_response.json
```

### 6. The whole pipeline in one go (a batch of tasks)

Steps 1–3 driven by a single task file. `tasks.json` — one entry per instance of the template
(`gold` is optional; with it the gate compares answers, without it it falls back to `self_verify`):

```json
[
{"id": "t132_a", "task": "How many commits did kilian make to a11yproject on 3/1/2023?",
"params": {"user": "kilian", "repo": "a11yproject", "date": "3/1/2023"}, "gold": 1},
{"id": "t132_b", "task": "How many commits did gao make to 2019 on 4/6/2023?",
"params": {"user": "gao", "repo": "2019", "date": "4/6/2023"}, "gold": 0}
]
```

```bash
START_URL=http://gitlab.example.com

# 1) solve every instance (sequential; add xargs -P N or & to parallelize)
jq -c '.[]' tasks.json | while read -r row; do
python -m webwright.run.cli main -t "$(jq -r .task <<<"$row")" \
--task-id "$(jq -r .id <<<"$row")" --start-url "$START_URL" -o outputs \
-c base.yaml -c model_openai.yaml
done

# 2) gate each run + assemble the manifest
python - <<'PY'
import json, glob
from webwright.skills import gate

TEMPLATE = "How many commits did {{user}} make to {{repo}} on {{date}}?"
SCHEMA = {"type": "number"}
runs = []
for t in json.load(open("tasks.json")):
d = sorted(glob.glob(f"outputs/{t['id']}_*"))[-1] # newest run dir of this task
answer = json.load(open(f"{d}/agent_response.json"))["retrieved_data"]
g = gate(answer, gold=t.get("gold"), output_schema=SCHEMA) # gold if present, else self_verify
runs.append({"dir": d, "admit": g.admit, "params": t["params"], "verdict": "skip",
"site": "gitlab", "output_schema": SCHEMA})
json.dump({"template": TEMPLATE, "runs": runs}, open("batch.json", "w"), indent=2)
print(sum(r["admit"] for r in runs), "of", len(runs), "admitted")
PY

# 3) evolve the library
python -m webwright.skills.update --manifest batch.json --library ./library

# 4) solve NEW instances of the template WITH the library: prepend the skill hint to the
# prompt (SKILL_LIBRARY_ROOT alone is not enough — the hint is what tells the agent to query)
TASK="How many commits did byte make to empathy-prompts on 4/2/2023?"
PROMPT=$(python -c 'import sys; from webwright.skills import with_skill_hint
print(with_skill_hint(sys.argv[1], task=sys.argv[1], library="./library"))' "$TASK")
SKILL_LIBRARY_ROOT=./library python -m webwright.run.cli main -t "$PROMPT" \
--task-id t132_new --start-url "$START_URL" -o outputs -c base.yaml -c model_openai.yaml
```

Repeat 1–3 whenever a new batch of solves lands — the library evolves in place (new templates are
added, existing skills are refined, untouched skills stay as they are). This is exactly the loop
our WebArena evaluation runs (train → gate → update → held-out reuse).

## Components

| file | role |
|---|---|
| `library.py` | `Skill` + `Library(root)`: on-disk skills (`<id>/skill.py` + `meta.json`) |
| `retrieve.py` | `retrieve(task, library)` → ranked `Candidate`s (relevance) |
| `decide.py` | `decide(task, candidates)` → `Decision(verdict, skill_id, reason)` (utility: use/adapt/skip) |
| `gate.py` | `gate(result, method=gold\|self_verify\|none)` → admit? (keeps wrong solves out) |
| `update.py` | `evolve(traces, library)`: grow on the existing library — add / adapt-refine / keep; `_refine` parameterizes + decomposes into primitives, incrementally improving an existing skill |
| `llm.py` | `configure_llm(model)` + `llm()`: **backend-agnostic** via Webwright's `Model` abstraction; a bare CLI builds the model from `SKILL_MODEL_NAME`/`SKILL_MODEL_ENDPOINT` (or `OPENAI_*`) env — no hardcoded endpoint/key |
| `prompt.py` | `with_skill_hint(prompt, task, library)`: non-invasive task-prompt hint |

## Gate

`gate(method=...)` is the **admission** check (independent of the solving agent — not the same as
`self_reflection`, which is the agent's own completion condition):
- `gold` — compare against a known answer (benchmarks); strongest.
- `self_verify` — invariant only (non-empty + shape); weak placeholder when no gold exists. It does
not check *correctness*. Note: `self_reflection` cannot serve as the gate — `require_self_reflection_success`
makes it always `predicted_label==1`, so it would admit everything (agent grading itself).
- `none` — admit all (demos).

## Backend

Backend-agnostic. Either `configure_llm(model_config_or_Model)` once in-process, or set
`SKILL_MODEL_NAME` / `SKILL_MODEL_ENDPOINT` (falling back to `OPENAI_*`) so a bare tool invocation
uses the same backend as the running agent. No gateway or key is hardcoded.

## Results (summary)

Validated with this module (full data + analysis live in the companion research repo, not here):

- **WebArena — 10 templates × 3 domains (shopping_admin / gitlab / map), gold gate.** Per template
3 train solves build the library, 2 held-out instances measure reuse (WITH library vs from
scratch): held-out **70% vs 55% accuracy (+15pp), 14.7 vs 17.1 steps**; train 86% vs 76%.
4 held-out tasks unsolvable from scratch are solved with the library; net reuse-wins 7 : 1
regression. Largest saving: 33 steps → 10.
- **Retrieval stays reliable as the library grows:** all 20 held-out solves picked the correct
skill from the shared library (grown to 10 skills), including telling apart two near-duplicate
commit-counting skills.
- **Mixed-template batches evolve safely:** mixed batches add new templates, refine existing
skills in place, and leave skills with no new traces byte-identical — zero cross-contamination;
held-out reuse against a mixed-built library matches the per-template-built one.
- **Incremental growth:** a later batch improves the existing skill in place (keeps the working
functions, adds robustness) rather than rewriting it.
- **Gate prevents pollution:** wrong solves (7 of 30 train) are dropped and never enter the library.
- **Real website (public GitHub, read-only):** end-to-end loop works — two repos solved from
scratch → `update` distilled a parameterized skill → a held-out repo solved by reusing it
(agent called `skill_use`, verdict `use`, answer correct).
- **Reuse value is task-dependent:** step savings are modest on easy tasks (query overhead ≈ the
exploration it saves) and larger on harder tasks with more exploration to skip.
22 changes: 22 additions & 0 deletions src/webwright/skills/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
"""webwright.skills — a memory/skill library module for webwright.

Store solved tasks as reusable, executable code skills; retrieve + judge (use/adapt/skip) at
solve time; admit via a gate; and grow the library incrementally (evolve). Plugs into webwright
as a built-in submodule:
- solve-time reuse : the `skill_use` tool (agent invokes it like self_reflection / image_qa)
- offline growth : `update.evolve` (run after solves to distill gate-passed solves into skills)

Backend-agnostic: configure_llm(model) wires it to any webwright Model.
"""
from .library import Library, Skill
from .retrieve import retrieve, Candidate
from .decide import decide, Decision
from .gate import gate, GateResult
from .update import evolve, Trace
from .llm import configure_llm
from .prompt import with_skill_hint

__all__ = [
"Library", "Skill", "retrieve", "Candidate", "decide", "Decision",
"gate", "GateResult", "evolve", "Trace", "configure_llm", "with_skill_hint",
]
50 changes: 50 additions & 0 deletions src/webwright/skills/decide.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
"""Decide whether to use: candidates + task -> use / adapt / skip (utility).

Stable interface (swappable implementation):
decide(task, candidates, *, method="llm") -> Decision
Relevant != useful: retrieve gives "how similar", decide gives "whether and how to use it".
"""
from __future__ import annotations
from dataclasses import dataclass

from .llm import llm_json


@dataclass
class Decision:
verdict: str # "use" | "adapt" | "skip"
skill_id: str | None
reason: str


def _decide_llm(task: str, candidates) -> Decision:
if not candidates:
return Decision("skip", None, "no candidate skills")
cat = "\n".join(
f"- skill_id: {c.skill.skill_id} | template: {c.skill.meta.get('template','')} | "
f"summary: {c.skill.summary} | params: {c.skill.signature.get('params', [])}"
for c in candidates
)
sys = (
"Decide whether a library skill is worth using for THIS task. Output STRICT JSON: "
'{"verdict":"use|adapt|skip","skill_id":"...","reason":"..."}.\n'
"- use = the skill fits the task as-is (just different parameter values).\n"
"- adapt = the skill's expensive core (login / navigation / extraction) is reusable, but the "
"FINAL step differs; the agent should reuse the front and add/adapt only the last step.\n"
"- skip = no candidate is worth it; solve from scratch (skill_id = null).\n"
"Relevance is not enough — only 'use'/'adapt' if it genuinely saves work."
)
user = f"## Task\n{task}\n\n## Candidate skills (most relevant first)\n{cat}"
out = llm_json(sys, user)
verdict = out.get("verdict", "skip")
if verdict not in ("use", "adapt", "skip"):
verdict = "skip"
skill_id = out.get("skill_id") if verdict != "skip" else None
return Decision(verdict=verdict, skill_id=skill_id, reason=out.get("reason", ""))


_DECIDERS = {"llm": _decide_llm}


def decide(task: str, candidates, *, method: str = "llm") -> Decision:
return _DECIDERS[method](task, candidates)
Loading