Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ Save hours of research time—get straight to what matters.
| AgentStudio | https://computer-agents.github.io/agent-studio | 2024 | Open toolkit for creating and benchmarking general-purpose virtual agents, supporting complex interactions across diverse software applications. | NA | Step Success Rate | Action Match, State Information and Image Match | Windows, Linux, macOS |
| CRAB | https://github.com/crab-benchmark | 2024 | Cross-environment benchmark evaluating agents across mobile and desktop devices, using a graph-based evaluation method to handle multiple correct paths and task flexibility. | 120 tasks | Step Success Rate, Efficiency Score | Action Match | Linux, Android |
| ScreenSpot | https://github.com/niucckevin/SeeClick | 2024 | Vision-based GUI benchmark with pre-trained GUI grounding, assessing agents' ability to interact with GUI elements across mobile, desktop, and web platforms using only screenshots. | 1,200 instructions | Step Success Rate | Action Match | iOS, Android, macOS, Windows, Web |
| ClawBench | https://github.com/reacher-z/ClawBench | 2026 | Live production websites (booking, ordering, applying, signing up). Two-stage scoring: HTTP-request interception at the per-task URL/method schema + LLM judge on the intercepted payload. Catches "right endpoint, wrong payload" errors. | V1: 153 tasks / 144 sites · V2: 130 tasks / 63 platforms | Intercepted Rate (Stage 1), Reward Rate (Stage 2) | Element Match, Payload Schema Match, LLM-as-Judge | Web |

..............<br>

Expand Down