measurementbenchmark

AgentDojo

Debenedetti et al. (ETH Zurich SPY Lab) · NeurIPS 2024

A dynamic evaluation framework for the adversarial robustness of tool-using LLM agents. Instead of a static dataset, AgentDojo runs agents through realistic multi-step tool-calling environments where attacker-controlled data flows back through tool outputs — the exact channel a prompt injection exploits in production.

What it measures

Two axes jointly, so you can't trade one for the other silently:

Utility — does the agent complete the legitimate user task? 97 tasks span four domains: workspace (email/calendar/cloud drive), banking, travel booking, and a Slack-like environment.
Targeted attack success / robustness — when malicious instructions are planted in untrusted tool data, does the agent execute the attacker's injection task instead? Pairing user tasks with injection tasks yields 629 security test cases.

Scores are reported per (defense, attack) pair, so a defense that tanks utility to dodge injections is visible immediately.

How red teamers use it

Agent hijacking tests — measure whether your agent can be steered to send data to an attacker, move money, or take unauthorized tool actions via injected content.
Plugin interfaces — implement a common interface to drop in new attacks (e.g. tool-knowledge / important-message injections), defenses (tool filtering, prompt sandwiching, data delimiting), models, and task suites. Run agentdojo's benchmark harness across the matrix.
Extend it with your own domain tools to red-team a specific deployment rather than the stock suites.

Strengths & limits

Strengths: the first rigorous, reproducible security benchmark for tool-using agents — dynamic execution (not memorizable strings), utility and attack measured together, and a clean extension API for new attacks/defenses. Frontier agents still fail a meaningful share of injection cases, so headroom is real.

Limits: a fixed catalog of synthetic tasks and injection templates — strong on a benchmark suite ≠ secure against an adaptive human attacker. Coverage is evolving; forks (agentdojo-core ↗, NIST's AgentDojo-Inspect) extend domains and attacks. Treat results as a floor on robustness, not a ceiling.