Response-Aware Multi-Agent Attacks for Eliciting Malicious Code
AgentTroop is a response-aware multi-agent framework for conducting adaptive adversarial attacks against code-generating LLMs. Rather than treating each interaction as an isolated attempt, AgentTroop continuously learns from victim responses, maintains explicit hypotheses about the target's defensive behavior, and refines future attack strategies accordingly.
AgentTroop maintains a Bayesian Version Space
At each iteration, the framework designs a probing prompt (intervention) to discriminate among alternative defense programs, executes it against the victim, and updates beliefs. Interventions are selected via Expected Free Energy (EFE) minimization to maximize information gain about the victim's decision boundary:
AgentTroop uses 5 LLM-based agents coordinated through the shared Version Space:
┌──────────────────────────────────────────────────┐
│ Orchestrator Agent │
│ Schedules agents, maintains Version Space, │
│ Bayesian belief update, convergence check │
└──────┬──────────┬──────────┬──────────┬──────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│Cognitive │ │Strategist│ │Researcher│ │Red Team │
│ Agent │ │ Agent │ │ Agent │ │ Agent │
├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤
│Detect │ │Identify │ │Synthesize│ │Generate │
│behavioral│ │competing │ │refined │ │executable│
│incon- │ │defense │ │defense │ │probing │
│sistencies│ │programs │ │programs │ │prompts │
│→ generate│ │→ formulate│ │→ inject │ │via │
│hypotheses│ │inter- │ │into V │ │jailbreak │
│& programs│ │ventions │ │ │ │techniques│
└──────────┘ └──────────┘ └──────────┘ └──────────┘
| Agent | Role |
|---|---|
| Orchestrator | Collects interaction traces, schedules agents, maintains Version Space |
| Cognitive Agent | Analyzes interaction traces, identifies behavioral inconsistencies (same base prompt, different outcomes under transformation), generates defense hypotheses |
| Strategist Agent | Identifies competing defense programs via posterior belief, performs symbolic analysis to find distinguishing conditions, formulates intervention objectives |
| Researcher Agent | Periodically consolidates interaction evidence (every |
| Red Team Agent | Refines probing prompts using jailbreak transformations (21 techniques via UCB bandit selection), semantic reframing, role-playing, and prompt engineering |
Seed Prompts → Behavioral Variants → Victim Probing →
Anomaly Detection → Hypothesis Generation → Version Space Init →
[Main Loop]
Competing Program Selection → Intervention Design (EFE) →
Red Team Refinement → Victim Probing → Outcome Classification →
Belief Update → Convergence Check (entropy & accuracy)
[Every N: Researcher Synthesis → Inject New Programs]
AgentTroop represents the victim model's hypothesized defense behavior as programs in a custom domain-specific language (DSL). These programs are executable ASTs that predict whether a given prompt will be REFUSEd (1) or ACCEPTed (0).
Program
└── IfThenElseNode
├── condition: Node ← predicate/logic expression
├── then_outcome: 1 ← REFUSE when condition is TRUE
└── else_outcome: 0 ← ACCEPT when condition is FALSE
Atomic nodes evaluate a single prompt feature:
PredicateNode— Boolean test (e.g.,contains_word("code"),length_gt(100),starts_with("In a"))ClassifierNode— Scalar score with threshold (e.g.,toxicity > 0.8,sentiment < 0.3)TransformNode— Prompt transformation applied before predicate evaluation
Composite nodes combine conditions:
AndNode(left, right)— True iff both children are TrueOrNode(left, right)— True iff either child is TrueNotNode(child)— NegationThresholdNode(classifier, threshold)— Threshold classifier by operator (>,<)
// Keyword-based: REFUSE if prompt contains "code"
IfThenElseNode(
condition = PredicateNode(contains_word, word="code", case_sensitive=False),
then_outcome = 1,
else_outcome = 0
)
// Length-based: ACCEPT very short prompts
IfThenElseNode(
condition = PredicateNode(length_lt, threshold=30),
then_outcome = 0,
else_outcome = 1
)
// Composite: REFUSE if prompt starts with an imperative verb AND asks for code
IfThenElseNode(
condition = AndNode(
left = PredicateNode(starts_with_imperative),
right = PredicateNode(contains_any_word, words=["code", "script", "program"])
),
then_outcome = 1,
else_outcome = 0
)
// Classifier threshold: REFUSE if toxicity score > 0.8
IfThenElseNode(
condition = ThresholdNode(
classifier = ToxicityScoreClassifier,
threshold = 0.8,
operator = "gt"
),
then_outcome = 1,
else_outcome = 0
)
The DSL is backed by a unified primitive registry (core/primitive.py) with 92 primitives in three categories:
Predicates (27) — Boolean functions over prompt text:
- Lexical:
contains_word(word),contains_any_word(words),starts_with(prefix),ends_with(suffix),matches_regex(pattern),is_imperative(),is_grammatical_question() - Structural:
length_gt(threshold),length_lt(threshold),char_count(op, threshold),has_emoji(),contains_url(),contains_code_block() - Semantic:
sentiment(),intent(type),is_instruction_request(),is_repetitive() - Jailbreak signals:
contains_encoding_wrapper(),matches_jailbreak_pattern(),contains_system_override(),contains_delimiter()
Transforms (19) — Prompt string transformations for intervention design:
- Semantic:
to_lowercase(),to_uppercase(),to_interrogative(),to_imperative(),to_declarative(),random_case(),insert_synonyms() - Structural:
add_prefix(prefix),add_suffix(suffix),remove_punctuation(),escape_quotes(),format_as_json(),wrap_code_block(language),add_markdown() - Encoding:
html_encode(),add_zero_width_chars(),pad_to_length(n),add_ignore_filter_token(token),add_role_play(role)
Classifiers (27) — Continuous scoring functions for threshold-based decisions:
toxicity_score(),sentiment_score(),obscurity_score(),jailbreak_likelihood(),code_likelihood()refusal_similarity(),harmfulness_similarity(),roleplay_likelihood(),persuasion_score()length_score(),repetition_score(),entropy_score(),unique_token_ratio()special_char_ratio(),digit_ratio(),uppercase_ratio(),punctuation_ratio(),whitespace_ratio()contains_blacklisted_word(),gpt2_perplexity(),encoding_detection()prompt_injection_likelihood(),adversarial_suffix_score(),sql_likelihood(),json_likelihood()
Programs are executed by the ProgramExecutor (core/executor.py), which walks the AST directly — no LLM calls are made at prediction time:
executor.execute(program, prompt) -> Outcome # 0 = ACCEPT, 1 = REFUSEThe executor recursively evaluates each node:
- PredicateNode: Calls
primitive.evaluate(prompt)→bool→ mapped to outcome - ThresholdNode: Calls
classifier.score(prompt)→ compares against threshold - AndNode/OrNode: Short-circuit evaluation of children
- NotNode: Negates child's result
- IfThenElseNode: If condition True →
then_outcome, else →else_outcome
| Module | Purpose |
|---|---|
core/primitive.py |
92 primitives: 27 Predicates, 19 Transforms, 27 Classifiers for building defense programs |
core/program.py |
AST-based defense program representation (IfThenElse, And, Or, Not, Threshold) |
core/executor.py |
ProgramExecutor — evaluates defense programs against prompts |
core/jailbreak.py |
21 jailbreak techniques with templates (DAN, GCG, hex_injection, persona, etc.) and UCB1 bandit selection |
inference/version_space.py |
Bayesian Version Space: top-K candidate programs, posterior beliefs, entropy, information gain |
inference/efe.py |
Expected Free Energy computation for intervention selection |
inference/belief_updater.py |
Bayesian belief update with soft likelihood ( |
agents/cognitive.py |
Behavioral anomaly detection (group-by-base-prompt, entropy scoring, adaptive percentile) and hypothesis generation |
agents/strategist.py |
Disagreement-driven intervention design (Δ scoring → EFE rescore → semantic rescore), program prediction, technique selection |
agents/researcher.py |
Evolutionary synthesis (pop=100, gen=30, mut=0.2, cross=0.7) and program verification |
agents/red_team.py |
Jailbreak-aware probing prompt refinement |
orchestration/orchestrator.py |
6-phase loop coordination, belief management, convergence detection |
synthesis/ |
Evolutionary synthesizer (genetic programming), verifier |
evaluation/ |
Evaluators for RQ0 (FA), RQ1 (PBA, NI), RQ2 (MR), RQ3 (ablation) |
| Metric | Description |
|---|---|
| Attack Success Rate (ASR) | % of harmful prompts that bypass safety mechanisms |
| Malicious Rate (MR) | % of successful jailbreak responses containing actionable malicious code |
| Final Accuracy (FA) | Synthesized program agreement with victim on unseen prompts |
| Peak Balanced Accuracy (PBA) | Best balanced accuracy on harmful + benign validation set |
| Number of Interventions (NI) | Interventions required to reach PBA |
- Python 3.10+
- UV (Python package manager)
- Redis (for session memory)
- Ollama (for local victims) or OpenRouter / OpenAI API key (for cloud victims)
- Neo4j (optional, for scientific memory and causal graph)
# macOS
brew install redis
brew install neo4j # optional
brew install ollama # for local victims
# Ubuntu/Debian
sudo apt install redis-server
# Download Neo4j from https://neo4j.com/download/
# Install Ollama: curl -fsSL https://ollama.com/install.sh | sh# Install UV (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone <repo-url> && cd HARMONY_X
uv syncOr use the automated setup script:
bash experiments/setup.shCopy .env.template to .env and fill in your API keys:
cp .env.template .envRequired variables in .env:
| Variable | Required for | Description |
|---|---|---|
OPENROUTER_API_KEY |
OpenRouter backend | API key from https://openrouter.ai/keys |
OPENAI_API_KEY |
OpenAI backend / agentic LLM | API key from https://platform.openai.com/api-keys |
REPLICATE_API_TOKEN |
Replicate backend | API token from https://replicate.com/account/api-tokens |
Optional variables:
| Variable | Default | Description |
|---|---|---|
HX_REDIS_URL |
redis://localhost:6379/0 |
Redis connection string |
HX_NEO4J_URI |
bolt://localhost:7687 |
Neo4j connection URI |
HX_NEO4J_USER |
neo4j |
Neo4j username |
HX_NEO4J_PASSWORD |
password |
Neo4j password |
HARMFUL_CSV |
prompt.csv |
Path to RMCBench harmful prompts CSV |
ollama pull codellama:7b
ollama pull llama3.1:8b
ollama pull phi4
ollama pull qwen
ollama pull deepseek-r1:8bbrew services start redis # Redis (session memory)
brew services start neo4j # Neo4j (optional, scientific memory)
ollama serve # Ollama (only if using local victims)Configuration is managed via YAML files in configs/ and experiments/configs/:
| File | Purpose |
|---|---|
configs/experiment_config.yaml |
Default experiment config |
configs/config_quick_test.yaml |
Quick debug config (1 seed, 25 iterations, limited transforms) |
configs/config_test_fix.yaml |
Fix validation config |
experiments/configs/ollama_experiment_config.yaml |
Ollama backend config |
experiments/configs/openrouter_experiment_config.yaml |
OpenRouter backend config |
experiments/configs/openai_experiment_config.yaml |
OpenAI backend config |
# Orchestrator
orchestrator:
max_iterations: 50 # Max main loop iterations
max_interventions: 500 # Max total interventions
synthesis_interval: 3 # Run Researcher every N interventions
entropy_convergence_threshold: 0.1 # Stop when H < threshold
accuracy_threshold: 0.85 # Stop when program accuracy >= threshold
# Cognitive Agent
cognitive:
anomaly_threshold: 0.15 # Minimum anomaly score
anomaly_selection:
method: "percentile" # "percentile" or "threshold"
percentile: 85 # Top percentile for anomaly selection
# Strategist Agent
strategist:
max_chain_depth: 4 # Max transform chain depth
max_candidates_heuristic: 120 # Max heuristic candidates
num_trials: 4 # Prediction trials for non-deterministic classifiers
# Researcher / Synthesis
synthesis:
mode: "evolutionary"
evolutionary:
population_size: 150 # GA population
generations: 50 # GA generations
mutation_rate: 0.25 # Mutation probability
crossover_rate: 0.75 # Crossover probability
# Victim (Ollama)
victim:
ollama_url: "http://localhost:11434"
model_name: "llama3.1:8b"
temperature: 0.0
max_tokens: 150
# Victim (OpenRouter)
victim:
model_name: "meta-llama/llama-3.2-3b-instruct"
temperature: 0.0
max_tokens: 150# Toy victim (deterministic, no API calls)
python experiments/run_experiment.py \
--config configs/config_quick_test.yaml \
--backend ollama \
--model-name "" \
--num-seeds 3# Via Ollama (local) — use all prompts
python experiments/run_experiment.py \
--backend ollama \
--config experiments/configs/ollama_experiment_config.yaml \
--model-name "codellama:7b" \
--full
# Via OpenRouter (API)
python experiments/run_experiment.py \
--backend openrouter \
--config experiments/configs/openrouter_experiment_config.yaml \
--model-name "meta-llama/llama-3.1-8b-instruct" \
--full \
# Via OpenRouter with DeepSeek
python experiments/run_experiment.py \
--backend openrouter \
--config experiments/configs/openrouter_experiment_config.yaml \
--model-name "deepseek/deepseek-v3.2" \
--full# Default (Ollama, llama3.1:8b)
bash experiments/run_exp.sh --full
# With options
bash experiments/run_exp.sh \
--backend openrouter \
--model-name "meta-llama/llama-3.1-8b-instruct" \
--full \
--config experiments/configs/openrouter_experiment_config.yaml| Argument | Default | Description |
|---|---|---|
--config |
configs/<backend>_experiment_config.yaml |
Path to config YAML |
--num-seeds |
100 | Number of seed prompts for initial variants |
--full |
False | Use ALL prompts from CSV (overrides --num-seeds) |
--backend |
ollama |
Victim backend: ollama, openrouter, openai, replicate |
--agentic-backend |
openai |
Backend for agent LLMs (cognitive, red-team, judge) |
--model-name |
from config | Override victim model name |
--num-asr |
full |
Seed prompts for ASR evaluation (full = all 473 RMCBench prompts, or a number) |
--num-variants |
5 | Variants per prompt in ASR evaluation (total = --num-asr × --num-variants) |
--max-techniques |
0 | Limit jailbreak techniques (0 = all 21) |
--judge-backend |
same as agentic | Backend for judge LLM |
--judge-model |
backend default | Model for judge LLM |
--prior-campaign |
None | Prior campaign ID for transfer evaluation (RQ3) |
--ablation-strategist |
False | Disable Strategist (random probing baseline) |
--ablation-cognitive |
False | Disable Cognitive LLM (fallback hypotheses only) |
# Inject generic task-oriented system prompt into victim (use max prompts)
python experiments/run_with_system_prompt.py \
--model-name "meta-llama/llama-3.1-8b-instruct" \
--num-seeds 473
# With free-form system prompt
python experiments/run_with_system_prompt.py \
--model-name "meta-llama/llama-3.1-8b-instruct" \
--num-seeds 473 \
--free# Run all ablations with toy victim
python -m experiments.ablation.run_all
# Run individual ablations
python -m experiments.ablation.no_synthesis
python -m experiments.ablation.random_probing
python -m experiments.ablation.no_scientific_memory
python -m experiments.ablation.no_llm| Mode | Flag | What changes |
|---|---|---|
| No Strategist | --ablation-strategist |
Random pair selection, identity-only interventions |
| No Cognitive LLM | --ablation-cognitive |
Keyword-only fallback hypotheses (no LLM) |
| No Synthesis | no_synthesis wrapper |
No evolutionary synthesis — VS only has compiled programs |
| No Scientific Memory | no_scientific_memory wrapper |
Disables Neo4j graph store |
| Random Probing | random_probing wrapper |
Random intervention design (no VS-driven selection) |
After a campaign completes, run evaluation:
python experiments/run_evaluation.py \
--campaign <campaign_id> \
--num-asr full \
--num-variants 5 \
--accuracy-threshold 0.85 \
--output-dir evaluation/reports
# Example
python experiments/run_evaluation.py \
--campaign deepseek_deepseek_v3_2_20260622_230237 \
--num-asr full \
--num-variants 5| Argument | Default | Description |
|---|---|---|
--campaign |
required | Campaign ID to evaluate |
--program-id |
best program | Specific program ID for RQ0 evaluation |
--num-asr |
full |
Seed prompts for ASR evaluation (full = all, or a number) |
--num-variants |
5 | Variants per prompt |
--accuracy-threshold |
0.85 | Threshold for RQ0/RQ1 |
--judge |
llm |
Judge type: rule or llm |
--llm-model |
gemma-4-31b-it |
LLM model for LLMJudge |
--baseline-campaign |
None | Prior campaign for RQ1 random probing comparison |
--transfer-threshold |
0.9 | Transfer accuracy threshold for RQ3 |
--prompt-csv |
prompt.csv |
Path to RMCBench harmful prompts |
outputs/campaign/<campaign_id>/
├── <campaign_id>_episodic.db # SQLite — all interaction traces
├── final_program.json # Best discovered defense program
├── final_theory.json # Abstracted safety theory
├── hypotheses_history.json # Generated hypothesis summaries
├── interventions_history.json # All probing attempts (187+ episodes)
├── version_space.json # VS state: candidates, entropy, posterior history
├── sde_evidence.json # Semantic boundary evidence
├── technique_stats.json # UCB technique selection statistics
├── evaluation/
│ └── evaluation_report.json # ASR, MR, FA, PBA, NI results
The project uses RMCBench (Benchmarking Large Language Models' Resistance to Malicious Code) — the first benchmark specifically designed to measure LLM resistance to malicious code generation, published at IEEE/ACM ASE 2024. It contains 473 seed prompts across two evaluation scenarios:
| Scenario | Description | Levels |
|---|---|---|
| Text-to-Code | Natural language descriptions requesting malicious code generation | Level 1: explicit keywords; Level 2: implicit (metaphorical) |
| Code-to-Code | Code translation or completion of malicious code snippets | Code Translation, Code Completion |
Scope:
- 11 malware categories: Viruses, Worms, Trojan horses, Spyware, Adware, Ransomware, Rootkits, Phishing, Vulnerability Exploitation, Network attacks, Others
- 9 programming languages: C, C++, C#, Go, Java, PHP, Python, HTML/JavaScript, Bash
- 28.71% average refusal rate across 11 popular LLMs, highlighting the challenge of safety alignment
The full dataset (473 prompts × variants) has 38,168 rows in prompt.csv (columns: pid,category,task,level,description,level,prompt,malicious functionality,...,language,...,code to be completed,...).
Benign prompts for balanced accuracy evaluation: data/benign_prompts.csv.
- Ollama (local): codellama:7b/13b/34b, llama3.1:8b, phi4, qwen, deepseek-r1:8b, meetai-small
- OpenRouter (API): 50+ models via unified API (GPT-4o, Claude, Gemini, DeepSeek, LLaMA, etc.)
- OpenAI (API): GPT-4o-mini, GPT-4o
- Replicate (API): Various open models
- Toy victims (testing/diagnostic): KeywordFilter, LengthFilter, RegexVictim, ThresholdVictim
- Novel approach: Response-aware multi-agent framework combining coordinated agent reasoning, Bayesian hypothesis maintenance, and hypothesis-guided intervention generation
- Empirical evaluation: Comprehensive evaluation against proprietary and open-source LLMs (GPT-4o-mini, Gemini-2.5-Flash, Phi-4, LLaMA-3.1-8B, DeepSeek-V3.2, Qwen2.5-72B, CodeLlama, MiMo-2.5-Pro) using the RMCBench benchmark
- Open science: Full replication package with code and data
@article{agenttroop2026,
title={The Wolf Pack: Response-Aware Multi-Agent Attacks for Eliciting Malicious Code},
author={Anonymous Authors},
year={2026},
journal={arXiv preprint}
}