the attack canonblack-box

Many-shot Jailbreaking

Anil et al. (Anthropic) · 2024

Paper Many-shot Jailbreaking (Anthropic) ↗

Pad the prompt with hundreds of fake dialogues in which the assistant cheerfully answers harmful questions, and the model learns — in-context, mid-conversation — to keep doing the same.

What it exploits

Two properties that modern LLMs are supposed to have. First, large context windows: where 2023-era models held a few thousand tokens, current models accept hundreds of thousands to a million, leaving room for an enormous number of priming examples. Second, in-context learning — the model's ability to pick up a pattern from examples in the prompt alone, with no fine-tuning. Many-shot jailbreaking (MSJ) turns both capabilities against the safety training: the attack lives entirely in the prompt, so it is fully black-box and needs no weights, gradients, or API internals.

How it works

The attacker constructs a single prompt containing many faux turns of a conversation. In each one, a simulated user asks something the model should refuse, and a simulated assistant complies in detail. The real target request is appended at the end. The model, having "observed" dozens to hundreds of compliant exchanges, continues the established pattern instead of refusing.

Scales with shot count. A handful of examples does little; effectiveness climbs as the number of demonstrations grows, and the harmful-response rate follows a power law in the number of shots — predictable enough to extrapolate.
Bigger models are more vulnerable. Stronger in-context learning is exactly what makes a capable model better at absorbing the malicious pattern, so capability and susceptibility move together.
Generalizes across tasks and providers. The same template works across many harm categories and was effective against models from Anthropic and several peers, including when combined with other jailbreak techniques.

Why it matters

MSJ is a direct consequence of the race to ever-longer context windows — a feature shipped for legitimate reasons (long documents, agents, RAG) that simultaneously widens the attack surface. It is also disarmingly simple: no obfuscation, no special tokens, no optimization loop, just volume. That makes it cheap to run, easy to vary, and broadly portable across vendors — the kind of attack a red team should assume is available to anyone.

Defenses & detection

No single mitigation fully closes it, and the trade-offs are uncomfortable:

Fine-tuning to refuse the pattern raises the bar but is incomplete — it mostly pushes out the shot count at which the attack succeeds rather than stopping it.
Prompt classification and modification before inference is the most effective lever Anthropic reported, cutting one attack's success rate from 61% to 2% in their tests.
Capping context length bounds how many shots fit, but trades away the very capability long contexts were built for.