CAGE (Cybersecurity Agent Gym & Evaluation) is an evaluation framework for already-installed AI coding agents — Claude Code, Codex, Qwen, Kimi, or your own. It runs each agent inside its own Docker container against a pluggable benchmark, intercepts every LLM call through an in-container proxy, snapshots state before and after, and scores the trial. You supply what to evaluate; CAGE owns how it runs.
CAGE is infrastructure — not a benchmark, not an agent. Everything domain-specific (samples, prompts, live targets, scoring) lives in a benchmark package outside the framework.
CAGE runs trials in Docker and Docker Compose, and uses Git LFS for benchmark assets. On Ubuntu/Debian, install the basic dependencies first:
sudo apt update
sudo apt install -y git git-lfs
git lfs installInstall uv:
curl -LsSf https://astral.sh/uv/install.sh | shgit clone https://github.com/AgentCyberRange/CAGE.git
cd CAGE
uv venv
source .venv/bin/activate
uv pip install -e .Copy the example model registry:
cp config/models.example.yml config/models.yml
cage model list
Register a GPT model:
export OPENAI_API_KEY=...
cage model set gpt-5.5 \
--provider openai \
--model gpt-5.5 \
--endpoint https://api.openai.com/v1 \
--api-key '${OPENAI_API_KEY}'or a Claude model:
export ANTHROPIC_API_KEY=...
cage model set claude-opus \
--provider anthropic \
--model claude-opus-4-7 \
--endpoint https://api.anthropic.com \
--api-key '${ANTHROPIC_API_KEY}'Full model-registry details are in models.md.
CAGE ships two AgentPentestBench datasets as git submodules. The submodules include a small bundled subset for smoke tests. The full datasets can be fetched separately from Hugging Face.
| Benchmark | GitHub (subset) | Hugging Face (full set) |
|---|---|---|
| WebExploitBench | WebExploitBench (comfyui, dataease, prestashop) | datasets/WebExploitBench |
| PostExploitBench | PostExploitBench (range-4, range-6) | datasets/PostExploitBench |
Make sure Git LFS is enabled before pulling the submodules. Otherwise, large target assets such as jars and archives may be checked out as LFS pointer files.
git lfs install
git submodule update --init --recursive \
examples/agent_pentest_bench/datasets/web_exploit_bench \
examples/agent_pentest_bench/datasets/post_exploit_bench
# fetch the actual LFS binaries (jars/zips) — submodule update skips these:
git -C examples/agent_pentest_bench/datasets/web_exploit_bench lfs pull
git -C examples/agent_pentest_bench/datasets/post_exploit_bench lfs pullInstall the Hugging Face CLI first if you have not already:
uv pip install huggingface_hubThen use scripts/fetch to fetch the full WebExploitBench and PostExploitBench datasets:
hf auth login
examples/agent_pentest_bench/datasets/web_exploit_bench/scripts/fetch
examples/agent_pentest_bench/datasets/post_exploit_bench/scripts/fetchThe fetch scripts only add data on top of the existing checkout and are safe to re-run.
Build the agent images:
cage agent build --agent codex --variant pentestenv
cage agent build --agent claude_code --variant pentestenv
Prebuild all benchmark targets:
cage benchmark build web_exploit_bench --max-concurrent 4
cage benchmark build post_exploit_bench --max-concurrent 4
You can also build a single benchmark sample, which is useful for smoke tests, or for retrying a specific target if it fails during a batch build:
cage benchmark build web_exploit_bench --sample pb-comfyui
cage benchmark build post_exploit_bench --sample pb-postexp-range-4
Note: Building agent images and benchmark targets can take a while, especially on the first run, since Docker images and target assets may need to be downloaded and built. When many targets are built concurrently, a small number of samples may fail due to transient Docker, network, or resource issues. In that case, rerun cage benchmark build with --sample to rebuild the failed sample only.
Default full runs use the benchmark config as-is:
cage run web_exploit_bench --agent codex --model gpt-5.5
cage run post_exploit_bench --agent codex --model gpt-5.5
# Evaluate a different agent/model pair:
cage run web_exploit_bench --agent claude_code --model claude-opusSingle-sample smoke runs. Sample IDs are pb-<web-target> and
pb-postexp-<range>, so these two use bundled targets and work without the
Hugging Face fetch:
# Default configs already set benchmark-level concurrency; pass `--max-concurrent N` only to lower the selected agent/model cap.
cage run web_exploit_bench \
--agent codex \
--model gpt-5.5 \
--sample pb-comfyui \
--prompt-level l0 \
--passk 1 \
--max-concurrent 1 \
--run-id web-smoke-001
cage run post_exploit_bench \
--agent codex \
--model gpt-5.5 \
--sample pb-postexp-range-4 \
--prompt-level l0 \
--passk 1 \
--max-concurrent 1 \
--run-id post-smoke-001Prompt levels control how much task information the agent receives. For web tasks, hints may reveal vulnerability location or type. For post-exploitation tasks, hints may reveal topology or services:
l0: no hintsl1: partial hintsl2: stronger hints
By default, cage run starts the browser inspector automatically. After the run completes, inspect the results in the browser.
To continue a named run, pass the same --run-id with --resume:
cage run web_exploit_bench --run-id web-smoke-001 --resume
You mostly type one command, cage run. A run is the framework executing a
benchmark under your config:
cage run = Framework ( Benchmark , Config )
└ Layer 1 ┘ └Layer 2 ┘ └Layer 3┘
the engine what to this run's
(fixed) evaluate knobs
- Layer 1 — Framework (
cage/) owns the run mechanism: container, proxy, target, scoring, resume. It never knows a benchmark name. - Layer 2 — Benchmark (
examples/<name>/) supplies what is evaluated: samples, prompts, targets, scorer. - Layer 3 — You supply how this run goes: an experiment YAML plus CLI flags.
Every other command is a slice of cage run, and every YAML field parameterizes
one of its steps. The full lifecycle is in
How a Run Works.
CAGE is benchmark-agnostic; these security benchmarks ship as example packages:
| Benchmark | What it evaluates |
|---|---|
| AgentPentestBench | Web exploitation (WebExploitBench) and multi-host post-exploitation ranges (PostExploitBench). The release-facing example, with full datasets on Hugging Face. |
| CVEBench | Whether an agent can exploit a known CVE in a live target. |
| NYU CTF | CTF-style capture-the-flag tasks. |
| AutoPenBench | Automated penetration-testing tasks. |
| HackWorld | Web CTF tasks. |
| StrongREJECT | Safety / refusal behavior (no live target). |
Adding your own is a new examples/<name>/ package — see
Writing Benchmarks. The framework (cage/)
never changes.
- Quick Start — fresh checkout to one inspected trial.
- How a Run Works — the run lifecycle and runtime internals.
- The CLI — every command as a slice of
cage run. - Running Experiments and Operations — scaling, resume, scoring, cleanup.
- Writing Benchmarks and Contributing — extend CAGE.
