From 49b47047fda2647a96bb2b746d91c3200a2484f3 Mon Sep 17 00:00:00 2001 From: Yuxuan Zhang Date: Wed, 20 May 2026 15:39:04 -0700 Subject: [PATCH] Add ClawBench to GUI Agent benchmarks --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 2616d5d..6aca2e6 100644 --- a/README.md +++ b/README.md @@ -86,6 +86,7 @@ Save hours of research time—get straight to what matters. | AgentStudio | https://computer-agents.github.io/agent-studio | 2024 | Open toolkit for creating and benchmarking general-purpose virtual agents, supporting complex interactions across diverse software applications. | NA | Step Success Rate | Action Match, State Information and Image Match | Windows, Linux, macOS | | CRAB | https://github.com/crab-benchmark | 2024 | Cross-environment benchmark evaluating agents across mobile and desktop devices, using a graph-based evaluation method to handle multiple correct paths and task flexibility. | 120 tasks | Step Success Rate, Efficiency Score | Action Match | Linux, Android | | ScreenSpot | https://github.com/niucckevin/SeeClick | 2024 | Vision-based GUI benchmark with pre-trained GUI grounding, assessing agents' ability to interact with GUI elements across mobile, desktop, and web platforms using only screenshots. | 1,200 instructions | Step Success Rate | Action Match | iOS, Android, macOS, Windows, Web | +| ClawBench | https://github.com/reacher-z/ClawBench | 2026 | Live production websites (booking, ordering, applying, signing up). Two-stage scoring: HTTP-request interception at the per-task URL/method schema + LLM judge on the intercepted payload. Catches "right endpoint, wrong payload" errors. | V1: 153 tasks / 144 sites · V2: 130 tasks / 63 platforms | Intercepted Rate (Stage 1), Reward Rate (Stage 2) | Element Match, Payload Schema Match, LLM-as-Judge | Web | ..............