RED TEAM // CANON
← back to the map
the attack canonblack-box

Many-shot Jailbreaking

Anil et al. (Anthropic) · 2024

Pad the prompt with hundreds of fake dialogues in which the assistant cheerfully answers harmful questions, and the model learns — in-context, mid-conversation — to keep doing the same.

What it exploits

Two properties that modern LLMs are supposed to have. First, large context windows: where 2023-era models held a few thousand tokens, current models accept hundreds of thousands to a million, leaving room for an enormous number of priming examples. Second, in-context learning — the model's ability to pick up a pattern from examples in the prompt alone, with no fine-tuning. Many-shot jailbreaking (MSJ) turns both capabilities against the safety training: the attack lives entirely in the prompt, so it is fully black-box and needs no weights, gradients, or API internals.

How it works

The attacker constructs a single prompt containing many faux turns of a conversation. In each one, a simulated user asks something the model should refuse, and a simulated assistant complies in detail. The real target request is appended at the end. The model, having "observed" dozens to hundreds of compliant exchanges, continues the established pattern instead of refusing.

Why it matters

MSJ is a direct consequence of the race to ever-longer context windows — a feature shipped for legitimate reasons (long documents, agents, RAG) that simultaneously widens the attack surface. It is also disarmingly simple: no obfuscation, no special tokens, no optimization loop, just volume. That makes it cheap to run, easy to vary, and broadly portable across vendors — the kind of attack a red team should assume is available to anyone.

Defenses & detection

No single mitigation fully closes it, and the trade-offs are uncomfortable: