dataanswer · reacher-z · May 20, 2026
diff --git a/README.md b/README.md
@@ -86,6 +86,7 @@ Save hours of research time—get straight to what matters.
 | AgentStudio | https://computer-agents.github.io/agent-studio | 2024 | Open toolkit for creating and benchmarking general-purpose virtual agents, supporting complex interactions across diverse software applications. | NA | Step Success Rate | Action Match, State Information and Image Match | Windows, Linux, macOS |
 | CRAB | https://github.com/crab-benchmark | 2024 | Cross-environment benchmark evaluating agents across mobile and desktop devices, using a graph-based evaluation method to handle multiple correct paths and task flexibility. | 120 tasks | Step Success Rate, Efficiency Score | Action Match | Linux, Android |
 | ScreenSpot | https://github.com/niucckevin/SeeClick | 2024 | Vision-based GUI benchmark with pre-trained GUI grounding, assessing agents' ability to interact with GUI elements across mobile, desktop, and web platforms using only screenshots. | 1,200 instructions | Step Success Rate | Action Match | iOS, Android, macOS, Windows, Web |
+| ClawBench | https://github.com/reacher-z/ClawBench | 2026 | Live production websites (booking, ordering, applying, signing up). Two-stage scoring: HTTP-request interception at the per-task URL/method schema + LLM judge on the intercepted payload. Catches "right endpoint, wrong payload" errors. | V1: 153 tasks / 144 sites · V2: 130 tasks / 63 platforms | Intercepted Rate (Stage 1), Reward Rate (Stage 2) | Element Match, Payload Schema Match, LLM-as-Judge | Web |
 
 ..............<br>