From 49b47047fda2647a96bb2b746d91c3200a2484f3 Mon Sep 17 00:00:00 2001
From: Yuxuan Zhang <reacher@cs.ubc.ca>
Date: Wed, 20 May 2026 15:39:04 -0700
Subject: [PATCH] Add ClawBench to GUI Agent benchmarks

---
 README.md | 1 +
 1 file changed, 1 insertion(+)
diff --git a/README.md b/README.md
index 2616d5d..6aca2e6 100644
--- a/README.md
+++ b/README.md
@@ -86,6 +86,7 @@ Save hours of research time—get straight to what matters.
 | AgentStudio | https://computer-agents.github.io/agent-studio | 2024 | Open toolkit for creating and benchmarking general-purpose virtual agents, supporting complex interactions across diverse software applications. | NA | Step Success Rate | Action Match, State Information and Image Match | Windows, Linux, macOS |
 | CRAB | https://github.com/crab-benchmark | 2024 | Cross-environment benchmark evaluating agents across mobile and desktop devices, using a graph-based evaluation method to handle multiple correct paths and task flexibility. | 120 tasks | Step Success Rate, Efficiency Score | Action Match | Linux, Android |
 | ScreenSpot | https://github.com/niucckevin/SeeClick | 2024 | Vision-based GUI benchmark with pre-trained GUI grounding, assessing agents' ability to interact with GUI elements across mobile, desktop, and web platforms using only screenshots. | 1,200 instructions | Step Success Rate | Action Match | iOS, Android, macOS, Windows, Web |
+| ClawBench | https://github.com/reacher-z/ClawBench | 2026 | Live production websites (booking, ordering, applying, signing up). Two-stage scoring: HTTP-request interception at the per-task URL/method schema + LLM judge on the intercepted payload. Catches "right endpoint, wrong payload" errors. | V1: 153 tasks / 144 sites · V2: 130 tasks / 63 platforms | Intercepted Rate (Stage 1), Reward Rate (Stage 2) | Element Match, Payload Schema Match, LLM-as-Judge | Web |
 
 ..............<br>