AgentTroop

Response-Aware Multi-Agent Attacks for Eliciting Malicious Code

AgentTroop is a response-aware multi-agent framework for conducting adaptive adversarial attacks against code-generating LLMs. Rather than treating each interaction as an isolated attempt, AgentTroop continuously learns from victim responses, maintains explicit hypotheses about the target's defensive behavior, and refines future attack strategies accordingly.

How It Works

AgentTroop maintains a Bayesian Version Space $\mathcal{V}$ containing candidate defense programs ${d_i}$, each representing a possible explanation of the victim's response. A belief distribution is maintained over these candidates and continuously updated through Bayesian inference:

$$P(d_i | e_t) = \frac{P(e_t | d_i) P(d_i)}{\sum_{d_j \in \mathcal{D}} P(e_t | d_j) P(d_j)}$$

At each iteration, the framework designs a probing prompt (intervention) to discriminate among alternative defense programs, executes it against the victim, and updates beliefs. Interventions are selected via Expected Free Energy (EFE) minimization to maximize information gain about the victim's decision boundary:

$$G(I) = \underbrace{\mathbb{E}_{o \sim P(o|I)} \left[ D_{KL}[b_{t+1} | b_t] \right]}_{\text{epistemic value}}$$

Architecture

AgentTroop uses 5 LLM-based agents coordinated through the shared Version Space:

┌──────────────────────────────────────────────────┐
│                 Orchestrator Agent                │
│  Schedules agents, maintains Version Space,       │
│  Bayesian belief update, convergence check        │
└──────┬──────────┬──────────┬──────────┬──────────┘
       │          │          │          │
       ▼          ▼          ▼          ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│Cognitive │ │Strategist│ │Researcher│ │Red Team  │
│ Agent    │ │ Agent    │ │ Agent    │ │ Agent    │
├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤
│Detect    │ │Identify  │ │Synthesize│ │Generate  │
│behavioral│ │competing │ │refined   │ │executable│
│incon-    │ │defense   │ │defense   │ │probing   │
│sistencies│ │programs  │ │programs  │ │prompts   │
│→ generate│ │→ formulate│ │→ inject  │ │via       │
│hypotheses│ │inter-    │ │into V    │ │jailbreak │
│& programs│ │ventions  │ │          │ │techniques│
└──────────┘ └──────────┘ └──────────┘ └──────────┘

Agent	Role
Orchestrator	Collects interaction traces, schedules agents, maintains Version Space $\mathcal{V}$, performs Bayesian belief updates, detects convergence
Cognitive Agent	Analyzes interaction traces, identifies behavioral inconsistencies (same base prompt, different outcomes under transformation), generates defense hypotheses
Strategist Agent	Identifies competing defense programs via posterior belief, performs symbolic analysis to find distinguishing conditions, formulates intervention objectives
Researcher Agent	Periodically consolidates interaction evidence (every $N$ interventions), performs evolutionary synthesis to generate new candidate programs, injects them into Version Space
Red Team Agent	Refines probing prompts using jailbreak transformations (21 techniques via UCB bandit selection), semantic reframing, role-playing, and prompt engineering

Core Loop

Seed Prompts → Behavioral Variants → Victim Probing →
Anomaly Detection → Hypothesis Generation → Version Space Init →
[Main Loop]
  Competing Program Selection → Intervention Design (EFE) →
  Red Team Refinement → Victim Probing → Outcome Classification →
  Belief Update → Convergence Check (entropy & accuracy)
  [Every N: Researcher Synthesis → Inject New Programs]

Defense Program DSL

AgentTroop represents the victim model's hypothesized defense behavior as programs in a custom domain-specific language (DSL). These programs are executable ASTs that predict whether a given prompt will be REFUSEd (1) or ACCEPTed (0).

AST Structure

Program
└── IfThenElseNode
    ├── condition: Node           ← predicate/logic expression
    ├── then_outcome: 1           ← REFUSE when condition is TRUE
    └── else_outcome: 0           ← ACCEPT when condition is FALSE

Atomic nodes evaluate a single prompt feature:

PredicateNode — Boolean test (e.g., contains_word("code"), length_gt(100), starts_with("In a"))
ClassifierNode — Scalar score with threshold (e.g., toxicity > 0.8, sentiment < 0.3)
TransformNode — Prompt transformation applied before predicate evaluation

Composite nodes combine conditions:

AndNode(left, right) — True iff both children are True
OrNode(left, right) — True iff either child is True
NotNode(child) — Negation
ThresholdNode(classifier, threshold) — Threshold classifier by operator (>, <)

Example Programs

// Keyword-based: REFUSE if prompt contains "code"
IfThenElseNode(
  condition = PredicateNode(contains_word, word="code", case_sensitive=False),
  then_outcome = 1,
  else_outcome = 0
)

// Length-based: ACCEPT very short prompts
IfThenElseNode(
  condition = PredicateNode(length_lt, threshold=30),
  then_outcome = 0,
  else_outcome = 1
)

// Composite: REFUSE if prompt starts with an imperative verb AND asks for code
IfThenElseNode(
  condition = AndNode(
    left = PredicateNode(starts_with_imperative),
    right = PredicateNode(contains_any_word, words=["code", "script", "program"])
  ),
  then_outcome = 1,
  else_outcome = 0
)

// Classifier threshold: REFUSE if toxicity score > 0.8
IfThenElseNode(
  condition = ThresholdNode(
    classifier = ToxicityScoreClassifier,
    threshold = 0.8,
    operator = "gt"
  ),
  then_outcome = 1,
  else_outcome = 0
)

Primitive Types

The DSL is backed by a unified primitive registry (core/primitive.py) with 92 primitives in three categories:

Predicates (27) — Boolean functions over prompt text:

Lexical: contains_word(word), contains_any_word(words), starts_with(prefix), ends_with(suffix), matches_regex(pattern), is_imperative(), is_grammatical_question()
Structural: length_gt(threshold), length_lt(threshold), char_count(op, threshold), has_emoji(), contains_url(), contains_code_block()
Semantic: sentiment(), intent(type), is_instruction_request(), is_repetitive()
Jailbreak signals: contains_encoding_wrapper(), matches_jailbreak_pattern(), contains_system_override(), contains_delimiter()

Transforms (19) — Prompt string transformations for intervention design:

Semantic: to_lowercase(), to_uppercase(), to_interrogative(), to_imperative(), to_declarative(), random_case(), insert_synonyms()
Structural: add_prefix(prefix), add_suffix(suffix), remove_punctuation(), escape_quotes(), format_as_json(), wrap_code_block(language), add_markdown()
Encoding: html_encode(), add_zero_width_chars(), pad_to_length(n), add_ignore_filter_token(token), add_role_play(role)

Classifiers (27) — Continuous scoring functions for threshold-based decisions:

toxicity_score(), sentiment_score(), obscurity_score(), jailbreak_likelihood(), code_likelihood()
refusal_similarity(), harmfulness_similarity(), roleplay_likelihood(), persuasion_score()
length_score(), repetition_score(), entropy_score(), unique_token_ratio()
special_char_ratio(), digit_ratio(), uppercase_ratio(), punctuation_ratio(), whitespace_ratio()
contains_blacklisted_word(), gpt2_perplexity(), encoding_detection()
prompt_injection_likelihood(), adversarial_suffix_score(), sql_likelihood(), json_likelihood()

Program Execution

Programs are executed by the ProgramExecutor (core/executor.py), which walks the AST directly — no LLM calls are made at prediction time:

executor.execute(program, prompt) -> Outcome  # 0 = ACCEPT, 1 = REFUSE

The executor recursively evaluates each node:

PredicateNode: Calls primitive.evaluate(prompt) → bool → mapped to outcome
ThresholdNode: Calls classifier.score(prompt) → compares against threshold
AndNode/OrNode: Short-circuit evaluation of children
NotNode: Negates child's result
IfThenElseNode: If condition True → then_outcome, else → else_outcome

Key Components

Module	Purpose
`core/primitive.py`	92 primitives: 27 Predicates, 19 Transforms, 27 Classifiers for building defense programs
`core/program.py`	AST-based defense program representation (IfThenElse, And, Or, Not, Threshold)
`core/executor.py`	ProgramExecutor — evaluates defense programs against prompts
`core/jailbreak.py`	21 jailbreak techniques with templates (DAN, GCG, hex_injection, persona, etc.) and UCB1 bandit selection
`inference/version_space.py`	Bayesian Version Space: top-K candidate programs, posterior beliefs, entropy, information gain
`inference/efe.py`	Expected Free Energy computation for intervention selection
`inference/belief_updater.py`	Bayesian belief update with soft likelihood ($\varepsilon = 0.1$)
`agents/cognitive.py`	Behavioral anomaly detection (group-by-base-prompt, entropy scoring, adaptive percentile) and hypothesis generation
`agents/strategist.py`	Disagreement-driven intervention design (Δ scoring → EFE rescore → semantic rescore), program prediction, technique selection
`agents/researcher.py`	Evolutionary synthesis (pop=100, gen=30, mut=0.2, cross=0.7) and program verification
`agents/red_team.py`	Jailbreak-aware probing prompt refinement
`orchestration/orchestrator.py`	6-phase loop coordination, belief management, convergence detection
`synthesis/`	Evolutionary synthesizer (genetic programming), verifier
`evaluation/`	Evaluators for RQ0 (FA), RQ1 (PBA, NI), RQ2 (MR), RQ3 (ablation)

Evaluation Metrics

Metric	Description
Attack Success Rate (ASR)	% of harmful prompts that bypass safety mechanisms
Malicious Rate (MR)	% of successful jailbreak responses containing actionable malicious code
Final Accuracy (FA)	Synthesized program agreement with victim on unseen prompts
Peak Balanced Accuracy (PBA)	Best balanced accuracy on harmful + benign validation set
Number of Interventions (NI)	Interventions required to reach PBA

Installation

Prerequisites

Python 3.10+
UV (Python package manager)
Redis (for session memory)
Ollama (for local victims) or OpenRouter / OpenAI API key (for cloud victims)
Neo4j (optional, for scientific memory and causal graph)

Step 1: System Dependencies

# macOS
brew install redis
brew install neo4j    # optional
brew install ollama   # for local victims

# Ubuntu/Debian
sudo apt install redis-server
# Download Neo4j from https://neo4j.com/download/
# Install Ollama: curl -fsSL https://ollama.com/install.sh | sh

Step 2: Python Environment

# Install UV (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone <repo-url> && cd HARMONY_X
uv sync

Or use the automated setup script:

bash experiments/setup.sh

Step 3: Environment Variables

Copy .env.template to .env and fill in your API keys:

cp .env.template .env

Required variables in .env:

Variable	Required for	Description
`OPENROUTER_API_KEY`	OpenRouter backend	API key from https://openrouter.ai/keys
`OPENAI_API_KEY`	OpenAI backend / agentic LLM	API key from https://platform.openai.com/api-keys
`REPLICATE_API_TOKEN`	Replicate backend	API token from https://replicate.com/account/api-tokens

Optional variables:

Variable	Default	Description
`HX_REDIS_URL`	`redis://localhost:6379/0`	Redis connection string
`HX_NEO4J_URI`	`bolt://localhost:7687`	Neo4j connection URI
`HX_NEO4J_USER`	`neo4j`	Neo4j username
`HX_NEO4J_PASSWORD`	`password`	Neo4j password
`HARMFUL_CSV`	`prompt.csv`	Path to RMCBench harmful prompts CSV

Step 4: Pull Victim Models (Ollama only)

ollama pull codellama:7b
ollama pull llama3.1:8b
ollama pull phi4
ollama pull qwen
ollama pull deepseek-r1:8b

Step 5: Start Services

brew services start redis        # Redis (session memory)
brew services start neo4j        # Neo4j (optional, scientific memory)
ollama serve                     # Ollama (only if using local victims)

Experiment Configuration

Configuration is managed via YAML files in configs/ and experiments/configs/:

Core Config Files

File	Purpose
`configs/experiment_config.yaml`	Default experiment config
`configs/config_quick_test.yaml`	Quick debug config (1 seed, 25 iterations, limited transforms)
`configs/config_test_fix.yaml`	Fix validation config
`experiments/configs/ollama_experiment_config.yaml`	Ollama backend config
`experiments/configs/openrouter_experiment_config.yaml`	OpenRouter backend config
`experiments/configs/openai_experiment_config.yaml`	OpenAI backend config

Key Configuration Parameters

# Orchestrator
orchestrator:
  max_iterations: 50              # Max main loop iterations
  max_interventions: 500          # Max total interventions
  synthesis_interval: 3           # Run Researcher every N interventions
  entropy_convergence_threshold: 0.1  # Stop when H < threshold
  accuracy_threshold: 0.85        # Stop when program accuracy >= threshold

# Cognitive Agent
cognitive:
  anomaly_threshold: 0.15         # Minimum anomaly score
  anomaly_selection:
    method: "percentile"          # "percentile" or "threshold"
    percentile: 85                # Top percentile for anomaly selection

# Strategist Agent
strategist:
  max_chain_depth: 4              # Max transform chain depth
  max_candidates_heuristic: 120   # Max heuristic candidates
  num_trials: 4                   # Prediction trials for non-deterministic classifiers

# Researcher / Synthesis
synthesis:
  mode: "evolutionary"
  evolutionary:
    population_size: 150          # GA population
    generations: 50               # GA generations
    mutation_rate: 0.25           # Mutation probability
    crossover_rate: 0.75          # Crossover probability

# Victim (Ollama)
victim:
  ollama_url: "http://localhost:11434"
  model_name: "llama3.1:8b"
  temperature: 0.0
  max_tokens: 150

# Victim (OpenRouter)
victim:
  model_name: "meta-llama/llama-3.2-3b-instruct"
  temperature: 0.0
  max_tokens: 150

Running Experiments

Quick Test (Debug Mode)

# Toy victim (deterministic, no API calls)
python experiments/run_experiment.py \
    --config configs/config_quick_test.yaml \
    --backend ollama \
    --model-name "" \
    --num-seeds 3

Full Experiment

# Via Ollama (local) — use all prompts
python experiments/run_experiment.py \
    --backend ollama \
    --config experiments/configs/ollama_experiment_config.yaml \
    --model-name "codellama:7b" \
    --full

# Via OpenRouter (API)
python experiments/run_experiment.py \
    --backend openrouter \
    --config experiments/configs/openrouter_experiment_config.yaml \
    --model-name "meta-llama/llama-3.1-8b-instruct" \
    --full \

# Via OpenRouter with DeepSeek
python experiments/run_experiment.py \
    --backend openrouter \
    --config experiments/configs/openrouter_experiment_config.yaml \
    --model-name "deepseek/deepseek-v3.2" \
    --full

Using the Shell Script

# Default (Ollama, llama3.1:8b)
bash experiments/run_exp.sh --full

# With options
bash experiments/run_exp.sh \
    --backend openrouter \
    --model-name "meta-llama/llama-3.1-8b-instruct" \
    --full \
    --config experiments/configs/openrouter_experiment_config.yaml

Experiment CLI Arguments

Argument	Default	Description
`--config`	`configs/<backend>_experiment_config.yaml`	Path to config YAML
`--num-seeds`	100	Number of seed prompts for initial variants
`--full`	False	Use ALL prompts from CSV (overrides --num-seeds)
`--backend`	`ollama`	Victim backend: `ollama`, `openrouter`, `openai`, `replicate`
`--agentic-backend`	`openai`	Backend for agent LLMs (cognitive, red-team, judge)
`--model-name`	from config	Override victim model name
`--num-asr`	`full`	Seed prompts for ASR evaluation (`full` = all 473 RMCBench prompts, or a number)
`--num-variants`	5	Variants per prompt in ASR evaluation (total = `--num-asr` × `--num-variants`)
`--max-techniques`	0	Limit jailbreak techniques (0 = all 21)
`--judge-backend`	same as agentic	Backend for judge LLM
`--judge-model`	backend default	Model for judge LLM
`--prior-campaign`	None	Prior campaign ID for transfer evaluation (RQ3)
`--ablation-strategist`	False	Disable Strategist (random probing baseline)
`--ablation-cognitive`	False	Disable Cognitive LLM (fallback hypotheses only)

System Prompt Conditioning

# Inject generic task-oriented system prompt into victim (use max prompts)
python experiments/run_with_system_prompt.py \
    --model-name "meta-llama/llama-3.1-8b-instruct" \
    --num-seeds 473

# With free-form system prompt
python experiments/run_with_system_prompt.py \
    --model-name "meta-llama/llama-3.1-8b-instruct" \
    --num-seeds 473 \
    --free

Ablation Studies

# Run all ablations with toy victim
python -m experiments.ablation.run_all

# Run individual ablations
python -m experiments.ablation.no_synthesis
python -m experiments.ablation.random_probing
python -m experiments.ablation.no_scientific_memory
python -m experiments.ablation.no_llm

Ablation Modes

Mode	Flag	What changes
No Strategist	`--ablation-strategist`	Random pair selection, identity-only interventions
No Cognitive LLM	`--ablation-cognitive`	Keyword-only fallback hypotheses (no LLM)
No Synthesis	`no_synthesis` wrapper	No evolutionary synthesis — VS only has compiled programs
No Scientific Memory	`no_scientific_memory` wrapper	Disables Neo4j graph store
Random Probing	`random_probing` wrapper	Random intervention design (no VS-driven selection)

Evaluation

After a campaign completes, run evaluation:

python experiments/run_evaluation.py \
    --campaign <campaign_id> \
    --num-asr full \
    --num-variants 5 \
    --accuracy-threshold 0.85 \
    --output-dir evaluation/reports

# Example
python experiments/run_evaluation.py \
    --campaign deepseek_deepseek_v3_2_20260622_230237 \
    --num-asr full \
    --num-variants 5

Evaluation CLI Arguments

Argument	Default	Description
`--campaign`	required	Campaign ID to evaluate
`--program-id`	best program	Specific program ID for RQ0 evaluation
`--num-asr`	`full`	Seed prompts for ASR evaluation (`full` = all, or a number)
`--num-variants`	5	Variants per prompt
`--accuracy-threshold`	0.85	Threshold for RQ0/RQ1
`--judge`	`llm`	Judge type: `rule` or `llm`
`--llm-model`	`gemma-4-31b-it`	LLM model for LLMJudge
`--baseline-campaign`	None	Prior campaign for RQ1 random probing comparison
`--transfer-threshold`	0.9	Transfer accuracy threshold for RQ3
`--prompt-csv`	`prompt.csv`	Path to RMCBench harmful prompts

Output Structure

outputs/campaign/<campaign_id>/
├── <campaign_id>_episodic.db       # SQLite — all interaction traces
├── final_program.json               # Best discovered defense program
├── final_theory.json                # Abstracted safety theory
├── hypotheses_history.json          # Generated hypothesis summaries
├── interventions_history.json       # All probing attempts (187+ episodes)
├── version_space.json               # VS state: candidates, entropy, posterior history
├── sde_evidence.json                # Semantic boundary evidence
├── technique_stats.json             # UCB technique selection statistics
├── evaluation/
│   └── evaluation_report.json       # ASR, MR, FA, PBA, NI results

Dataset

The project uses RMCBench (Benchmarking Large Language Models' Resistance to Malicious Code) — the first benchmark specifically designed to measure LLM resistance to malicious code generation, published at IEEE/ACM ASE 2024. It contains 473 seed prompts across two evaluation scenarios:

Scenario	Description	Levels
Text-to-Code	Natural language descriptions requesting malicious code generation	Level 1: explicit keywords; Level 2: implicit (metaphorical)
Code-to-Code	Code translation or completion of malicious code snippets	Code Translation, Code Completion

Scope:

11 malware categories: Viruses, Worms, Trojan horses, Spyware, Adware, Ransomware, Rootkits, Phishing, Vulnerability Exploitation, Network attacks, Others
9 programming languages: C, C++, C#, Go, Java, PHP, Python, HTML/JavaScript, Bash
28.71% average refusal rate across 11 popular LLMs, highlighting the challenge of safety alignment

The full dataset (473 prompts × variants) has 38,168 rows in prompt.csv (columns: pid,category,task,level,description,level,prompt,malicious functionality,...,language,...,code to be completed,...).

Benign prompts for balanced accuracy evaluation: data/benign_prompts.csv.

Supported Victims

Ollama (local): codellama:7b/13b/34b, llama3.1:8b, phi4, qwen, deepseek-r1:8b, meetai-small
OpenRouter (API): 50+ models via unified API (GPT-4o, Claude, Gemini, DeepSeek, LLaMA, etc.)
OpenAI (API): GPT-4o-mini, GPT-4o
Replicate (API): Various open models
Toy victims (testing/diagnostic): KeywordFilter, LengthFilter, RegexVictim, ThresholdVictim

Research Contributions

Novel approach: Response-aware multi-agent framework combining coordinated agent reasoning, Bayesian hypothesis maintenance, and hypothesis-guided intervention generation
Empirical evaluation: Comprehensive evaluation against proprietary and open-source LLMs (GPT-4o-mini, Gemini-2.5-Flash, Phi-4, LLaMA-3.1-8B, DeepSeek-V3.2, Qwen2.5-72B, CodeLlama, MiMo-2.5-Pro) using the RMCBench benchmark
Open science: Full replication package with code and data

BibTeX

@article{agenttroop2026,
  title={The Wolf Pack: Response-Aware Multi-Agent Attacks for Eliciting Malicious Code},
  author={Anonymous Authors},
  year={2026},
  journal={arXiv preprint}
}

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
adapters		adapters
agents		agents
configs		configs
core		core
data		data
docs		docs
evaluation		evaluation
experiments		experiments
graphrag		graphrag
inference		inference
knowledge		knowledge
llm		llm
meetaismall		meetaismall
orchestration		orchestration
scripts		scripts
sde		sde
synthesis		synthesis
tests		tests
tools		tools
victim		victim
.DS_Store		.DS_Store
.gitignore		.gitignore
ICSE2027_AgentTroop_Appendix.pdf		ICSE2027_AgentTroop_Appendix.pdf
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
emp_t_evaluation_collama34.json		emp_t_evaluation_collama34.json
emp_t_results_collama34.csv		emp_t_results_collama34.csv
prompt.csv		prompt.csv
prompt_loader.py		prompt_loader.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
rq_debug.md		rq_debug.md
uv.lock		uv.lock

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AgentTroop

How It Works

Architecture

Core Loop

Defense Program DSL

AST Structure

Example Programs

Primitive Types

Program Execution

Key Components

Evaluation Metrics

Installation

Prerequisites

Step 1: System Dependencies

Step 2: Python Environment

Step 3: Environment Variables

Step 4: Pull Victim Models (Ollama only)

Step 5: Start Services

Experiment Configuration

Core Config Files

Key Configuration Parameters

Running Experiments

Quick Test (Debug Mode)

Full Experiment

Using the Shell Script

Experiment CLI Arguments

System Prompt Conditioning

Ablation Studies

Ablation Modes

Evaluation

Evaluation CLI Arguments

Output Structure

Dataset

Supported Victims

Research Contributions

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages