the attack canonblack-box

PAIR — Prompt Automatic Iterative Refinement

Chao, Robey, Dobriban, Hassani, Pappas, Wong · 2023

Paper Jailbreaking Black Box Large Language Models in Twenty Queries ↗

PAIR automates jailbreak discovery by pitting one LLM against another: an attacker model rewrites a malicious prompt over a short feedback loop until a target model complies — typically in fewer than twenty queries and with no access to weights or gradients.

What it exploits

The same instruction-following and conversational flexibility that makes a model useful also makes it steerable away from its safety training. PAIR treats the target as a pure black box — only prompts in, text out, no logits, no internals. It exploits the fact that a refusal is informative: the target's own response reveals why it declined, and that signal is enough to mechanically improve the next attempt. It also leans on a structural weakness in prior work — manual jailbreaks (DAN-style personas, role-play framings) are powerful but slow to craft; PAIR shows the crafting itself can be delegated to a model.

How it works

PAIR runs a tight adversarial loop between two or three LLMs, seeded with a target behavior (e.g. "explain how to do X"):

Attacker LLM — generates a candidate jailbreak prompt and an explicit chain-of-thought on its strategy (role-play, obfuscation, hypothetical framing, etc.).
Target LLM — the black box under test; receives the candidate prompt and returns a response.
Judge LLM — scores how close the target's response came to the goal (e.g. 1–10) and whether it constitutes a successful jailbreak. That score plus the target's reply are fed back to the attacker.
Refine & repeat — the attacker uses the judge's feedback and the target's last response to write a sharper prompt. The loop continues until the judge flags success or a query budget (~20) is hit.

Because each iteration is conditioned on the previous failure, the search is far more directed than random or brute-force prompting. Parallel conversation "streams" can be run to raise the odds within the same budget. Crucially, no model internals are needed — it works against any system you can send a prompt to and read a reply from.

Why it matters

PAIR is the reference template for automated black-box red teaming. Earlier semantic attacks needed human ingenuity; earlier automated attacks (like GCG) needed white-box gradient access and produced unreadable token-salad suffixes. PAIR is both automated and black-box, and its jailbreaks are human-readable, semantically coherent prompts that often transfer across models. The query efficiency is the headline — twenty queries is cheap enough to run against commercial API endpoints at scale, which reframes jailbreak discovery from artisanal effort into a repeatable pipeline. Its attacker/judge loop became a building block for later systems (e.g. TAP, which adds tree-of-thought search).

Defenses & detection

No single fix neutralizes PAIR, but several layers raise its cost:

Input/output classifiers and guardrail models that screen for jailbreak intent and harmful completions independent of the main model.
Adversarial training and safety fine-tuning on PAIR-style discovered prompts to close the specific framings it finds.
Query-rate and behavioral monitoring — the iterative loop produces a recognizable burst of near-duplicate, escalating prompts from one session; rate limits and anomaly detection cut the budget the attacker relies on.
Perplexity filters help less here than against GCG — PAIR prompts read as natural language, so defenses tuned to gibberish suffixes won't catch them. Detection should target intent and escalation patterns, not fluency.