Add webwright.skills: a memory/skill-library module (reuse + accumulate solved tasks)#54
Open
DEM1TASSE wants to merge 12 commits into
Open
Add webwright.skills: a memory/skill-library module (reuse + accumulate solved tasks)#54DEM1TASSE wants to merge 12 commits into
DEM1TASSE wants to merge 12 commits into
Conversation
…tool
A built-in submodule turning solved tasks into reusable, executable code skills:
- skills/{library,retrieve,decide,gate,update,llm}: store / retrieve (relevance) /
decide (use·adapt·skip utility) / admission gate (gold|self_verify|none) /
evolve (incremental growth on existing library) — backend-agnostic via configure_llm
over webwright's own Model abstraction (no hardcoded gateway/key/path)
- tools/skill_use.py: solve-time tool (agent invokes like self_reflection/image_qa) ->
retrieve+decide -> JSON recommendation (use/adapt/skip + source path)
- python -m webwright.skills.update --manifest batch.json --library ./lib : batch growth
- tests/skills: 5 unit tests pass against the migrated module (logic == original)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…skill_use CLI - skills/prompt.with_skill_hint: prepend skill-library usage hint to task prompt (non-invasive; webwright merges system_template by replacement, so prompt-level is the clean way) - config/skill_mode.yaml: optional overlay doc + step budget for skill-reuse runs - llm._model(): bare CLI (python -m webwright.tools.skill_use) builds model from SKILL_MODEL_NAME/ENDPOINT (or OPENAI_*) env -> same backend as agent, no hardcoded gateway Co-Authored-By: Demi Wang <86202027+DEM1TASSE@users.noreply.github.com>
- README: what the module is, the two plug points (skill_use tool + update CLI), components table, gate semantics, backend config, results summary - llm._model(): bare CLI builds model from SKILL_MODEL_NAME/ENDPOINT (or OPENAI_*) env Co-Authored-By: Demi Wang <86202027+DEM1TASSE@users.noreply.github.com>
- README: Skill Library section (what it is, reuse via skill_use tool, grow via update CLI, end-to-end validation summary) - tests/skills: 5 unit tests for library/gate/update/evolve/retrieve+decide Co-Authored-By: Demi Wang <86202027+DEM1TASSE@users.noreply.github.com>
Co-Authored-By: Demi Wang <86202027+DEM1TASSE@users.noreply.github.com>
Remove _grow / update() / _UPDATERS dispatch — evolve() is the single entry now; drop the test_update test that exercised the removed grow path. Keep retrieve/llm fallbacks (useful). Co-Authored-By: Demi Wang <86202027+DEM1TASSE@users.noreply.github.com>
…val) Three bugs hit when update.refine emits a large skill on a slow gateway: - llm() ignored max_tokens -> model default ~4000 truncated the refined skill mid-code - llm() had no timeout override -> model default 120s ReadTimeout'd on the ~16k-token refine (now request_timeout_seconds defaults 600, env SKILL_MODEL_TIMEOUT) - _extract_code returned raw text (with ```python fence) when the closing fence was missing (truncated) -> skill failed to compile; now strips the opening fence anyway Co-Authored-By: Demi Wang <86202027+DEM1TASSE@users.noreply.github.com>
…lve-time reuse, direct skill run) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Author
@microsoft-github-policy-service agree company="Microsoft" |
…te+manifest -> update -> reuse); fix output_schema examples to gate's {type} form
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…bArena numbers Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- traces_from_manifest: 'admit' is now REQUIRED per run — a missing gate verdict raises instead of silently defaulting to admitted (was the main pollution risk) - _slug: templates longer than 48 chars get a short content-hash suffix so two templates sharing a long prefix can no longer overwrite each other's skill - skill_use.recommend: the decision's skill_id must be one of the RETRIEVED candidates; anything else (LLM hallucination, even an existing library id) downgrades to skip - tests for all three Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… is truthy) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
webwright.skills— a memory / skill-library module (MVP): turn solved tasks into reusable,executable code skills, retrieve + judge them at solve time, gate what enters the library, and grow
the library incrementally. A self-evolving loop on top of Webwright's code-as-action solves:
This is the reuse + accumulation layer on top of Webwright's code-as-action solves: it consumes
the
final_script.pyevery solve already produces (plain or crafted mode — both work), accumulatesskills across tasks, judges when a prior skill applies, and improves skills as more solves arrive —
with a gate so wrong solves don't pollute the library. It complements
crafted_cli: wherecrafted_cliparameterizes a single task's script by anticipating what might vary,update.refineparameterizes across multiple verified solves — the differences actually observed between
instances become the parameters.
Modular composition (~633 lines of core code; +898 total incl. tests/README)
Eight small, single-responsibility modules — each with a stable interface and a swappable
implementation:
skills/library.pySkill+Library, skills on disk (skill.py+meta.json)skills/retrieve.pyskills/decide.pyskills/gate.pyskills/update.pyrefineparameterizes + decomposes into primitivesskills/llm.pyModel(no endpoint/key hardcoded)skills/prompt.pywith_skill_hint)tools/skill_use.pyHow it plugs in (no change to the agent loop or default config)
skill_usetool, invoked from bash likeself_reflection/image_qa:{verdict: use|adapt|skip, skill_id, source_path, how_to_reuse}.updateCLI distills a batch of gate-passed solves into aparameterized, primitive-decomposed skill:
Validation
WebArena: 10 templates × 3 domains — reuse lifts accuracy +15pp and saves steps on held-out tasks
10 retrieve-type task templates across shopping_admin / gitlab / map. Per template: 3 train
tasks build the library (solved from scratch; only gold-verified solves are admitted), 2 held-out
tasks (unseen instances of the template — different parameter values) measure reuse. Every task is
solved both WITH the library and from scratch (BASE) — 80 solves total.
Highlights:
library; net reuse-wins 7 vs 1 regression across the 20 held-out tasks.
33 steps (scratch) to 10 (reuse); a map routing task from 29 to 16.
contains verified-correct skills.
update.refinelifts per-instance differences into parametersand bakes the aggregation logic (top-n ranking, commit counting, route-time extraction) into
primitives, so unseen instances of the template solve by a direct
useof the skill.skill from the shared library (grown to 10 skills over the run), including telling apart two
near-duplicate gitlab commit-counting skills (by-date vs by-period).
evolvebatches produce 4 independent skills — new templates get added, existing skills arerefined in place (working functions kept), skills with no new traces stay byte-identical, and
zero cross-contamination between skills; held-out reuse against the mixed-built library matches
the per-template-built one.
Real website (public GitHub, read-only): the full loop end-to-end
Solve two repos from scratch ->
updatebuilds a parameterized skill -> a held-out repo is solved byreusing it (the agent calls
skill_use, verdictuse, answer correct). Reuse pays off most onmulti-step tasks where saved exploration outweighs the lookup overhead (see the WebArena numbers);
on short single-page lookups it is roughly break-even.
5 unit tests under
tests/skills/(library / gate / evolve / retrieve+decide).Status: a deliberately simplistic MVP
Most steps are a single LLM call (retrieve = one catalog prompt, decide = one prompt, refine =
one batched prompt) — chosen for clarity, not yet for scale/accuracy. The point is the modular
shape: each stage has a stable interface, so swapping in something stronger (embedding retrieval,
a learned ranker, WebJudge / cross-source consistency for the real-website gate) is a localized
change that does not touch the others or the agent loop.
Scope
Purely additive (+898 lines total), confined to
src/webwright/skills/,src/webwright/tools/skill_use.py,src/webwright/config/skill_mode.yaml,tests/skills/, and aREADME section. Of the +898, the actual implementation is ~511 lines of logic (non-blank,
non-comment, across the skills module + the
skill_usetool); the remainder is tests (~157),README/config (~89), and comments/docstrings/blank lines. No edits to the agent loop, models, or
existing configs. Module README:
src/webwright/skills/README.md.