RED TEAM // BENCHMARKS
← back to the map
measurementbenchmark

AgentDojo

Debenedetti et al. (ETH Zurich SPY Lab) · NeurIPS 2024

A dynamic evaluation framework for the adversarial robustness of tool-using LLM agents. Instead of a static dataset, AgentDojo runs agents through realistic multi-step tool-calling environments where attacker-controlled data flows back through tool outputs — the exact channel a prompt injection exploits in production.

What it measures

Two axes jointly, so you can't trade one for the other silently:

Scores are reported per (defense, attack) pair, so a defense that tanks utility to dodge injections is visible immediately.

How red teamers use it

Strengths & limits

Strengths: the first rigorous, reproducible security benchmark for tool-using agents — dynamic execution (not memorizable strings), utility and attack measured together, and a clean extension API for new attacks/defenses. Frontier agents still fail a meaningful share of injection cases, so headroom is real.

Limits: a fixed catalog of synthetic tasks and injection templates — strong on a benchmark suite ≠ secure against an adaptive human attacker. Coverage is evolving; forks (agentdojo-core ↗, NIST's AgentDojo-Inspect) extend domains and attacks. Treat results as a floor on robustness, not a ceiling.