RED TEAM // CANON
← back to the map
the attack canonblack-box

PAIR — Prompt Automatic Iterative Refinement

Chao, Robey, Dobriban, Hassani, Pappas, Wong · 2023

PAIR automates jailbreak discovery by pitting one LLM against another: an attacker model rewrites a malicious prompt over a short feedback loop until a target model complies — typically in fewer than twenty queries and with no access to weights or gradients.

What it exploits

The same instruction-following and conversational flexibility that makes a model useful also makes it steerable away from its safety training. PAIR treats the target as a pure black box — only prompts in, text out, no logits, no internals. It exploits the fact that a refusal is informative: the target's own response reveals why it declined, and that signal is enough to mechanically improve the next attempt. It also leans on a structural weakness in prior work — manual jailbreaks (DAN-style personas, role-play framings) are powerful but slow to craft; PAIR shows the crafting itself can be delegated to a model.

How it works

PAIR runs a tight adversarial loop between two or three LLMs, seeded with a target behavior (e.g. "explain how to do X"):

Because each iteration is conditioned on the previous failure, the search is far more directed than random or brute-force prompting. Parallel conversation "streams" can be run to raise the odds within the same budget. Crucially, no model internals are needed — it works against any system you can send a prompt to and read a reply from.

Why it matters

PAIR is the reference template for automated black-box red teaming. Earlier semantic attacks needed human ingenuity; earlier automated attacks (like GCG) needed white-box gradient access and produced unreadable token-salad suffixes. PAIR is both automated and black-box, and its jailbreaks are human-readable, semantically coherent prompts that often transfer across models. The query efficiency is the headline — twenty queries is cheap enough to run against commercial API endpoints at scale, which reframes jailbreak discovery from artisanal effort into a repeatable pipeline. Its attacker/judge loop became a building block for later systems (e.g. TAP, which adds tree-of-thought search).

Defenses & detection

No single fix neutralizes PAIR, but several layers raise its cost: