microsoft · DEM1TASSE · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026 · Jun 30, 2026
diff --git a/README.md b/README.md
@@ -165,6 +165,25 @@ python assets/task_showcase/app.py \
 
 ---
 
+## 🧠 Skill Library (reuse solved tasks across tasks)
+
+[`webwright.skills`](src/webwright/skills/) turns solved tasks into **reusable, executable code
+skills**, retrieves and judges them at solve time, gates what enters the library, and grows the
+library incrementally — a self-evolving *store → retrieve → use/adapt → gate → evolve* loop on top
+of Webwright's code-as-action solves. Plugs in with **no change to the agent loop**:
+
+- **Reuse** — the agent calls `python -m webwright.tools.skill_use --task "..." --library ...`
+  (like `self_reflection`/`image_qa`); it returns `{verdict: use|adapt|skip, source_path}`.
+- **Grow** — `python -m webwright.skills.update --manifest batch.json --library ./library`
+  distills a batch of gate-passed solves into a parameterized, primitive-decomposed skill.
+
+Validated end-to-end on a real public website (read-only GitHub): solve two repos from scratch →
+`update` builds a parameterized skill → a held-out repo is solved by reusing it (agent calls
+`skill_use`, verdict `use`, answer correct); a wrong solve is kept out by the gate; a second batch
+improves the existing skill in place. See [`src/webwright/skills/README.md`](src/webwright/skills/README.md).
+
+---
+
 ## 🚀 Quick Start
 
 ### Prerequisites

diff --git a/src/webwright/config/skill_mode.yaml b/src/webwright/config/skill_mode.yaml
@@ -0,0 +1,13 @@
+# Skill-library mode (optional overlay): raise the step budget headroom for skill reuse runs.
+# Enable by stacking:  webwright run ... -c base.yaml -c model_openai.yaml -c skill_mode.yaml
+#
+# How the agent is told to reuse skills: the SKILL-LIBRARY block is prepended to the TASK prompt
+# by the caller (see webwright.skills.prompt.with_skill_hint), NOT injected via system_template —
+# webwright merges system_template by replacement, so a prompt-level hint is the clean, non-invasive
+# way and keeps webwright's default behavior unchanged when this mode is off.
+#
+# Requires env:
+#   SKILL_LIBRARY_ROOT          path to the skill library (read by the skill_use tool)
+#   SKILL_MODEL_NAME / SKILL_MODEL_ENDPOINT  (optional) backend for skill_use; defaults to OPENAI_*
+agent:
+  step_limit: 100
diff --git a/src/webwright/skills/README.md b/src/webwright/skills/README.md
@@ -0,0 +1,257 @@
+# `webwright.skills` — a memory / skill-library module for Webwright
+
+Turn solved tasks into **reusable, executable code skills**, retrieve and judge them at solve
+time, gate what enters the library, and grow the library incrementally. A self-evolving loop:
+
+```
+solve  --(gate: gold | self_verify)-->  admit  --(evolve: refine + parameterize + primitives)-->  library
+  ^                                                                                                   |
+  |  skill_use tool: retrieve + decide (use / adapt / skip)  <--------------------------------------- +
+```
+
+This is the missing **reuse + accumulation** layer: it consumes the `final_script.py` every
+Webwright solve already produces (plain or crafted mode — both work), accumulates skills across
+tasks, judges when a prior skill applies, and improves skills as more solves arrive — with a gate
+so wrong solves don't pollute the library. It complements `crafted_cli`: where `crafted_cli`
+parameterizes a single task's script by anticipating what might vary, `update.refine`
+parameterizes **across multiple verified solves** — the differences actually observed between
+instances become the parameters.
+
+## How it plugs into Webwright
+
+Two touch points, **no change to the agent loop or default config**:
+
+1. **Reuse at solve time — the `skill_use` tool.** The agent invokes it from bash, exactly like
+   `self_reflection` / `image_qa`:
+   ```bash
+   python -m webwright.tools.skill_use --task "<the task>" --library "$SKILL_LIBRARY_ROOT"
+   ```
+   It returns JSON `{verdict: use|adapt|skip, skill_id, source_path, how_to_reuse}`. The agent
+   reads `source_path` and reuses the skill (use = as-is, adapt = reuse core + change last step,
+   skip = solve from scratch). `webwright.skills.with_skill_hint(prompt, ...)` prepends a one-line
+   usage hint to the task prompt so the agent remembers to query the library first.
+
+2. **Growth after solving — the `update` CLI.** Distill a batch of gate-passed solves into a
+   library skill (offline, not in the solve loop):
+   ```bash
+   python -m webwright.skills.update --manifest batch.json --library ./library
+   ```
+   Manifest schema and a full walkthrough: see **How to use** below.
+
+## How to use (end-to-end)
+
+The library grows **offline** from batches of solved tasks, and is consumed **at solve time** by
+the agent. Tasks are provided **manually** today — you pick which tasks to solve and batch. The
+current focus is **same-template generalization**, so feed several instances of the SAME template
+(3+ instances with different parameter values works well): `refine` aligns them, and exactly what
+differs between instances becomes the skill's parameters — more instances, wider generalization.
+(Planned: bootstrap — automatically expand one seed task into multiple instances.)
+
+### 1. Solve a few instances of a template (normal Webwright runs)
+
+```bash
+python -m webwright.run.cli main \
+  -t "How many commits did kilian make to a11yproject on 3/1/2023?" \
+  --task-id t132_a --start-url http://gitlab.example.com -o outputs \
+  -c base.yaml -c model_openai.yaml
+```
+
+Each run leaves a directory containing `final_script.py` (the executable solve) and
+`agent_response.json` (the answer). Repeat for 2–3 more instances of the same template with
+different values (another user / repo / date).
+
+### 2. Gate the solves, write the manifest
+
+Judge each run (`gate(result, method="gold")` against a known answer, or `method="self_verify"`
+without one) and write one manifest per batch:
+
+```jsonc
+// batch.json
+{
+  "template": "How many commits did {{user}} make to {{repo}} on {{date}}?",
+  "runs": [
+    {
+      "dir": "outputs/t132_a_20260703_120000",   // run dir; final_script.py is read from it
+      "admit": true,                             // gate verdict — false rows NEVER enter the library
+      "params": {"user": "kilian", "repo": "a11yproject", "date": "3/1/2023"},
+      "verdict": "skip",                         // how the run used the library: skip = solved from
+                                                 // scratch; use / adapt = reused a skill
+                                                 // (adapt triggers refine-back into the skill)
+      "site": "gitlab",
+      "output_schema": {"type": "number"}        // required shape of retrieved_data
+      // "answer": optional — read from the run dir's agent_response.json when omitted
+    },
+    { "dir": "outputs/t132_b_20260703_121500", "admit": true,
+      "params": {"user": "gao", "repo": "2019", "date": "4/6/2023"},
+      "verdict": "skip", "site": "gitlab", "output_schema": {"type": "number"} }
+  ]
+}
+```
+
+Field by field:
+
+| field | required | meaning |
+|---|---|---|
+| `template` | yes | the template sentence with `{{param}}` placeholders. **Skills are keyed by it**: a manifest whose template already has a skill refines that skill in place; a new template adds a new skill. Use the same string across batches of the same template. |
+| `runs[].dir` | yes | a Webwright run directory; `final_script.py` is read from it |
+| `runs[].admit` | yes | the gate verdict; `false` rows are dropped and never enter the library |
+| `runs[].params` | yes | this instance's concrete values — `refine` aligns the runs and exposes exactly these differing values as the skill's arguments (this is what powers generalization) |
+| `runs[].verdict` | no (default `skip`) | how this run used the library: `skip` = solved from scratch; `use` = reused a skill as-is; `adapt` = reused + fixed the last step (**`adapt` is what triggers refining the fix back into the skill**) |
+| `runs[].site` | no | site tag stored in the skill's meta (helps retrieval) |
+| `runs[].output_schema` | no | required shape of `retrieved_data`, e.g. `{"type": "number"}` |
+| `runs[].answer` | no | this run's answer; read from the run dir's `agent_response.json` when omitted |
+
+### 3. Build / evolve the library
+
+```bash
+export OPENAI_API_KEY=...                        # backend key (never stored by the module)
+# optional — defaults to OPENAI_MODEL / OPENAI_ENDPOINT:
+export SKILL_MODEL_NAME=gpt-5.4 SKILL_MODEL_ENDPOINT=https://api.openai.com/v1/responses
+python -m webwright.skills.update --manifest batch.json --library ./library
+```
+
+Prints a changelog: `{"added": [...], "adapt_refined": [...], "use": [...], "dropped_wrong": n}`.
+Re-run with later batches any time — a new template **adds** a skill, new solves for an existing
+template **refine it in place** (keeps its working functions), templates with no new traces are
+left untouched. Batches may mix templates.
+
+### 4. Reuse at solve time
+
+```python
+from webwright.skills import with_skill_hint
+prompt = with_skill_hint(prompt, task=task_text, library="./library")
+```
+
+```bash
+SKILL_LIBRARY_ROOT=./library python -m webwright.run.cli main -t "$prompt" ...
+```
+
+The hint tells the agent to query the library first; the agent runs the `skill_use` tool, gets
+`{verdict, skill_id, source_path, how_to_reuse}`, reads the skill source, and reuses it
+(use = as-is with new parameter values, adapt = reuse the core + change the last step,
+skip = solve from scratch).
+
+### 5. Run a skill directly (optional)
+
+Every skill is also a standalone script: it reads a `taskspec.json` (parameters at run time) and
+writes `agent_response.json`:
+
+```bash
+cat > taskspec.json <<'EOF'
+{"params": {"user": "byte", "repo": "empathy-prompts", "date": "4/2/2023"},
+ "start_url": "http://gitlab.example.com", "credentials": null,
+ "output_schema": {"type": "number"}}
+EOF
+python library/how_many_commits_did_user_make_to_repo_on_date/skill.py taskspec.json
+cat agent_response.json
+```
+
+### 6. The whole pipeline in one go (a batch of tasks)
+
+Steps 1–3 driven by a single task file. `tasks.json` — one entry per instance of the template
+(`gold` is optional; with it the gate compares answers, without it it falls back to `self_verify`):
+
+```json
+[
+  {"id": "t132_a", "task": "How many commits did kilian make to a11yproject on 3/1/2023?",
+   "params": {"user": "kilian", "repo": "a11yproject", "date": "3/1/2023"}, "gold": 1},
+  {"id": "t132_b", "task": "How many commits did gao make to 2019 on 4/6/2023?",
+   "params": {"user": "gao", "repo": "2019", "date": "4/6/2023"}, "gold": 0}
+]
+```
+
+```bash
+START_URL=http://gitlab.example.com
+
+# 1) solve every instance (sequential; add xargs -P N or & to parallelize)
+jq -c '.[]' tasks.json | while read -r row; do
+  python -m webwright.run.cli main -t "$(jq -r .task <<<"$row")" \
+    --task-id "$(jq -r .id <<<"$row")" --start-url "$START_URL" -o outputs \
+    -c base.yaml -c model_openai.yaml
+done
+
+# 2) gate each run + assemble the manifest
+python - <<'PY'
+import json, glob
+from webwright.skills import gate
+
+TEMPLATE = "How many commits did {{user}} make to {{repo}} on {{date}}?"
+SCHEMA = {"type": "number"}
+runs = []
+for t in json.load(open("tasks.json")):
+    d = sorted(glob.glob(f"outputs/{t['id']}_*"))[-1]        # newest run dir of this task
+    answer = json.load(open(f"{d}/agent_response.json"))["retrieved_data"]
+    g = gate(answer, gold=t.get("gold"), output_schema=SCHEMA)   # gold if present, else self_verify
+    runs.append({"dir": d, "admit": g.admit, "params": t["params"], "verdict": "skip",
+                 "site": "gitlab", "output_schema": SCHEMA})
+json.dump({"template": TEMPLATE, "runs": runs}, open("batch.json", "w"), indent=2)
+print(sum(r["admit"] for r in runs), "of", len(runs), "admitted")
+PY
+
+# 3) evolve the library
+python -m webwright.skills.update --manifest batch.json --library ./library
+
+# 4) solve NEW instances of the template WITH the library: prepend the skill hint to the
+#    prompt (SKILL_LIBRARY_ROOT alone is not enough — the hint is what tells the agent to query)
+TASK="How many commits did byte make to empathy-prompts on 4/2/2023?"
+PROMPT=$(python -c 'import sys; from webwright.skills import with_skill_hint
+print(with_skill_hint(sys.argv[1], task=sys.argv[1], library="./library"))' "$TASK")
+SKILL_LIBRARY_ROOT=./library python -m webwright.run.cli main -t "$PROMPT" \
+  --task-id t132_new --start-url "$START_URL" -o outputs -c base.yaml -c model_openai.yaml
+```
+
+Repeat 1–3 whenever a new batch of solves lands — the library evolves in place (new templates are
+added, existing skills are refined, untouched skills stay as they are). This is exactly the loop
+our WebArena evaluation runs (train → gate → update → held-out reuse).
+
+## Components
+
+| file | role |
+|---|---|
+| `library.py`  | `Skill` + `Library(root)`: on-disk skills (`<id>/skill.py` + `meta.json`) |
+| `retrieve.py` | `retrieve(task, library)` → ranked `Candidate`s (relevance) |
+| `decide.py`   | `decide(task, candidates)` → `Decision(verdict, skill_id, reason)` (utility: use/adapt/skip) |
+| `gate.py`     | `gate(result, method=gold\|self_verify\|none)` → admit? (keeps wrong solves out) |
+| `update.py`   | `evolve(traces, library)`: grow on the existing library — add / adapt-refine / keep; `_refine` parameterizes + decomposes into primitives, incrementally improving an existing skill |
+| `llm.py`      | `configure_llm(model)` + `llm()`: **backend-agnostic** via Webwright's `Model` abstraction; a bare CLI builds the model from `SKILL_MODEL_NAME`/`SKILL_MODEL_ENDPOINT` (or `OPENAI_*`) env — no hardcoded endpoint/key |
+| `prompt.py`   | `with_skill_hint(prompt, task, library)`: non-invasive task-prompt hint |
+
+## Gate
+
+`gate(method=...)` is the **admission** check (independent of the solving agent — not the same as
+`self_reflection`, which is the agent's own completion condition):
+- `gold` — compare against a known answer (benchmarks); strongest.
+- `self_verify` — invariant only (non-empty + shape); weak placeholder when no gold exists. It does
+  not check *correctness*. Note: `self_reflection` cannot serve as the gate — `require_self_reflection_success`
+  makes it always `predicted_label==1`, so it would admit everything (agent grading itself).
+- `none` — admit all (demos).
+
+## Backend
+
+Backend-agnostic. Either `configure_llm(model_config_or_Model)` once in-process, or set
+`SKILL_MODEL_NAME` / `SKILL_MODEL_ENDPOINT` (falling back to `OPENAI_*`) so a bare tool invocation
+uses the same backend as the running agent. No gateway or key is hardcoded.
+
+## Results (summary)
+
+Validated with this module (full data + analysis live in the companion research repo, not here):
+
+- **WebArena — 10 templates × 3 domains (shopping_admin / gitlab / map), gold gate.** Per template
+  3 train solves build the library, 2 held-out instances measure reuse (WITH library vs from
+  scratch): held-out **70% vs 55% accuracy (+15pp), 14.7 vs 17.1 steps**; train 86% vs 76%.
+  4 held-out tasks unsolvable from scratch are solved with the library; net reuse-wins 7 : 1
+  regression. Largest saving: 33 steps → 10.
+- **Retrieval stays reliable as the library grows:** all 20 held-out solves picked the correct
+  skill from the shared library (grown to 10 skills), including telling apart two near-duplicate
+  commit-counting skills.
+- **Mixed-template batches evolve safely:** mixed batches add new templates, refine existing
+  skills in place, and leave skills with no new traces byte-identical — zero cross-contamination;
+  held-out reuse against a mixed-built library matches the per-template-built one.
+- **Incremental growth:** a later batch improves the existing skill in place (keeps the working
+  functions, adds robustness) rather than rewriting it.
+- **Gate prevents pollution:** wrong solves (7 of 30 train) are dropped and never enter the library.
+- **Real website (public GitHub, read-only):** end-to-end loop works — two repos solved from
+  scratch → `update` distilled a parameterized skill → a held-out repo solved by reusing it
+  (agent called `skill_use`, verdict `use`, answer correct).
+- **Reuse value is task-dependent:** step savings are modest on easy tasks (query overhead ≈ the
+  exploration it saves) and larger on harder tasks with more exploration to skip.
diff --git a/src/webwright/skills/__init__.py b/src/webwright/skills/__init__.py
@@ -0,0 +1,22 @@
+"""webwright.skills — a memory/skill library module for webwright.
+
+Store solved tasks as reusable, executable code skills; retrieve + judge (use/adapt/skip) at
+solve time; admit via a gate; and grow the library incrementally (evolve). Plugs into webwright
+as a built-in submodule:
+  - solve-time reuse  : the `skill_use` tool (agent invokes it like self_reflection / image_qa)
+  - offline growth    : `update.evolve` (run after solves to distill gate-passed solves into skills)
+
+Backend-agnostic: configure_llm(model) wires it to any webwright Model.
+"""
+from .library import Library, Skill
+from .retrieve import retrieve, Candidate
+from .decide import decide, Decision
+from .gate import gate, GateResult
+from .update import evolve, Trace
+from .llm import configure_llm
+from .prompt import with_skill_hint
+
+__all__ = [
+    "Library", "Skill", "retrieve", "Candidate", "decide", "Decision",
+    "gate", "GateResult", "evolve", "Trace", "configure_llm", "with_skill_hint",
+]
diff --git a/src/webwright/skills/decide.py b/src/webwright/skills/decide.py
@@ -0,0 +1,50 @@
+"""Decide whether to use: candidates + task -> use / adapt / skip (utility).
+
+Stable interface (swappable implementation):
+    decide(task, candidates, *, method="llm") -> Decision
+Relevant != useful: retrieve gives "how similar", decide gives "whether and how to use it".
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+
+from .llm import llm_json
+
+
+@dataclass
+class Decision:
+    verdict: str            # "use" | "adapt" | "skip"
+    skill_id: str | None
+    reason: str
+
+
+def _decide_llm(task: str, candidates) -> Decision:
+    if not candidates:
+        return Decision("skip", None, "no candidate skills")
+    cat = "\n".join(
+        f"- skill_id: {c.skill.skill_id} | template: {c.skill.meta.get('template','')} | "
+        f"summary: {c.skill.summary} | params: {c.skill.signature.get('params', [])}"
+        for c in candidates
+    )
+    sys = (
+        "Decide whether a library skill is worth using for THIS task. Output STRICT JSON: "
+        '{"verdict":"use|adapt|skip","skill_id":"...","reason":"..."}.\n'
+        "- use   = the skill fits the task as-is (just different parameter values).\n"
+        "- adapt = the skill's expensive core (login / navigation / extraction) is reusable, but the "
+        "FINAL step differs; the agent should reuse the front and add/adapt only the last step.\n"
+        "- skip  = no candidate is worth it; solve from scratch (skill_id = null).\n"
+        "Relevance is not enough — only 'use'/'adapt' if it genuinely saves work."
+    )
+    user = f"## Task\n{task}\n\n## Candidate skills (most relevant first)\n{cat}"
+    out = llm_json(sys, user)
+    verdict = out.get("verdict", "skip")
+    if verdict not in ("use", "adapt", "skip"):
+        verdict = "skip"
+    skill_id = out.get("skill_id") if verdict != "skip" else None
+    return Decision(verdict=verdict, skill_id=skill_id, reason=out.get("reason", ""))
+
+
+_DECIDERS = {"llm": _decide_llm}
+
+
+def decide(task: str, candidates, *, method: str = "llm") -> Decision:
+    return _DECIDERS[method](task, candidates)