🎯 AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges
Web exploitation benchmarks, Post-exploitation cyber ranges, and a unified evaluation framework.
WebExploitBench · PostExploitBench · CAGE Pipeline
AgentCyberRange is an open-source project for evaluating AI agents on realistic cyberattacks. It covers the main attack stages from web-facing exploitation to internal post-exploitation, and provides the execution framework needed to run these benchmarks across different agents and models.
- 🌐 WebExploitBench: evaluates web-facing exploration and exploitation over realistic web applications.
- 🕸️ PostExploitBench: evaluates post-exploitation techniques across enterprise-like cyber ranges.
- ⚙️ CAGE: parallel evaluation infrastructure for running agents, benchmarks, and verifiers at scale.
Most benchmarks stop at one checkpoint. AgentCyberRange follows the attack path: find the web entry, exploit it, use the foothold, and move through the internal range. CAGE makes this path measurable at scale through parallel, isolated, and verifiable agent runs.
AgentCyberRange evaluates AI agents across two realistic cyber-attack tasks: web-facing exploitation and internal post-exploitation. The benchmark uses multiple difficulty levels to measure how agents perform with different amounts of task information.
The results show that frontier agents can already solve a non-trivial fraction of realistic cyber-attack tasks, especially when given more task-specific information. However, success rates remain far from complete, indicating that reliable end-to-end autonomous compromise is still challenging.
| ⚙️ CAGE |
CAGE is the shared infrastructure layer for large-scale agent evaluation. It fans out agent × model × benchmark × prompt level × pass-k trials, runs them in parallel, and keeps each target isolated and resettable.
|
| 🎯 WebExploitBench |
Benchmark for web-facing cyber attacks. Includes 110 vulnerabilities across realistic web applications, covering zero-day, one-day, and synthetic vulnerabilities embedded in application workflows.
📦 Complete dataset: This GitHub repository releases only a subset of WebExploitBench. The complete dataset is available on Hugging Face. |
| 🕸️ PostExploitBench |
Benchmark for internal post-exploitation. Includes 156 hosts in enterprise-like ranges, covering tunneling, privilege escalation, credential reuse, lateral movement, persistence, and defense evasion.
📦 Complete dataset: This GitHub repository releases only a subset of PostExploitBench. The complete dataset is available on Hugging Face. |
Start with WebExploitBench, then run it through CAGE.
This track tests whether an agent can explore a realistic web application, identify exploitable routes and parameters, and produce PoCs that trigger verifier-observable effects.
Start with PostExploitBench, then run it through CAGE.
This track tests whether an agent can use a foothold, pivot through constrained networks, compromise additional hosts, and make progress under realistic internal-range conditions.
Start with CAGE.
CAGE provides the common execution layer for configuring models, launching agents, deploying benchmark targets, collecting model-call traces, resuming runs, verifying results, and inspecting failures.
Most user and developer documentation lives in CAGE:
- Getting Started: clone, model setup, dataset setup, first run.
- Running Experiments:
project.yml, dry runs, small/full runs, resume, dashboard inspection. - Writing Benchmarks: benchmark interface, targets, scorers, dashboard generation.
- Developing CAGE: runtime, agents, proxy, orchestration, web app.
- Operations: Docker cleanup, orphaned resources, run IDs, large-run monitoring.
If you use this project in your research, please cite:
@misc{liu2026agentcyberrange,
title={AgentCyberRange: Benchmarking Frontier {AI} Systems in Realistic Cyber Ranges},
author={Fengyu Liu and Jiarun Dai and Yihe Fan and Wuyuao Mai and Ziao Li and Bofei Chen and Jie Zhang and Zheng Lou and Bocheng Xiang and Qiyi Zhang and Xudong Pan and Geng Hong and Yuan Zhang and Min Yang},
year={2026},
eprint={2606.14295},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2606.14295}
}
AgentCyberRange is intended for controlled research and evaluation environments. Only run agents against systems you own or have explicit permission to test. Benchmark targets should be isolated, disposable, and operated in accordance with applicable laws and policies.


