AgentCyberRange

🎯 AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

Web exploitation benchmarks, Post-exploitation cyber ranges, and a unified evaluation framework.

WebExploitBench · PostExploitBench · CAGE Pipeline

What is AgentCyberRange?

AgentCyberRange is an open-source project for evaluating AI agents on realistic cyberattacks. It covers the main attack stages from web-facing exploitation to internal post-exploitation, and provides the execution framework needed to run these benchmarks across different agents and models.

🌐 WebExploitBench: evaluates web-facing exploration and exploitation over realistic web applications.
🕸️ PostExploitBench: evaluates post-exploitation techniques across enterprise-like cyber ranges.
⚙️ CAGE: parallel evaluation infrastructure for running agents, benchmarks, and verifiers at scale.

Why AgentCyberRange?

Most benchmarks stop at one checkpoint. AgentCyberRange follows the attack path: find the web entry, exploit it, use the foothold, and move through the internal range. CAGE makes this path measurable at scale through parallel, isolated, and verifiable agent runs.

Reference Evaluation Results

AgentCyberRange evaluates AI agents across two realistic cyber-attack tasks: web-facing exploitation and internal post-exploitation. The benchmark uses multiple difficulty levels to measure how agents perform with different amounts of task information.

The results show that frontier agents can already solve a non-trivial fraction of realistic cyber-attack tasks, especially when given more task-specific information. However, success rates remain far from complete, indicating that reliable end-to-end autonomous compromise is still challenging.

Core Repositories

⚙️ CAGE	CAGE is the shared infrastructure layer for large-scale agent evaluation. It fans out agent × model × benchmark × prompt level × pass-k trials, runs them in parallel, and keeps each target isolated and resettable.
🎯 WebExploitBench	Benchmark for web-facing cyber attacks. Includes 110 vulnerabilities across realistic web applications, covering zero-day, one-day, and synthetic vulnerabilities embedded in application workflows. 📦 Complete dataset: This GitHub repository releases only a subset of WebExploitBench. The complete dataset is available on Hugging Face.
🕸️ PostExploitBench	Benchmark for internal post-exploitation. Includes 156 hosts in enterprise-like ranges, covering tunneling, privilege escalation, credential reuse, lateral movement, persistence, and defense evasion. 📦 Complete dataset: This GitHub repository releases only a subset of PostExploitBench. The complete dataset is available on Hugging Face.

Where Should I Start?

I want to evaluate web exploitation ability

Start with WebExploitBench, then run it through CAGE.

This track tests whether an agent can explore a realistic web application, identify exploitable routes and parameters, and produce PoCs that trigger verifier-observable effects.

I want to evaluate post-exploitation ability

Start with PostExploitBench, then run it through CAGE.

This track tests whether an agent can use a foothold, pivot through constrained networks, compromise additional hosts, and make progress under realistic internal-range conditions.

I want to run or compare agents

Start with CAGE.

CAGE provides the common execution layer for configuring models, launching agents, deploying benchmark targets, collecting model-call traces, resuming runs, verifying results, and inspecting failures.

Documentation

Most user and developer documentation lives in CAGE:

Getting Started: clone, model setup, dataset setup, first run.
Running Experiments: project.yml, dry runs, small/full runs, resume, dashboard inspection.
Writing Benchmarks: benchmark interface, targets, scorers, dashboard generation.
Developing CAGE: runtime, agents, proxy, orchestration, web app.
Operations: Docker cleanup, orphaned resources, run IDs, large-run monitoring.

Citation

If you use this project in your research, please cite:

@misc{liu2026agentcyberrange,
  title={AgentCyberRange: Benchmarking Frontier {AI} Systems in Realistic Cyber Ranges},
  author={Fengyu Liu and Jiarun Dai and Yihe Fan and Wuyuao Mai and Ziao Li and Bofei Chen and Jie Zhang and Zheng Lou and Bocheng Xiang and Qiyi Zhang and Xudong Pan and Geng Hong and Yuan Zhang and Min Yang},
  year={2026},
  eprint={2606.14295},
  archivePrefix={arXiv},
  primaryClass={cs.CR},
  url={https://arxiv.org/abs/2606.14295}
}

Responsible Use

AgentCyberRange is intended for controlled research and evaluation environments. Only run agents against systems you own or have explicit permission to test. Benchmark targets should be isolated, disposable, and operated in accordance with applicable laws and policies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AgentCyberRange

What is AgentCyberRange?

Why AgentCyberRange?

Reference Evaluation Results

Core Repositories

Where Should I Start?

I want to evaluate web exploitation ability

I want to evaluate post-exploitation ability

I want to run or compare agents

Documentation

Citation

Responsible Use

Popular repositories Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!