RED TEAM // CANON
← back to the map
the attack canonblack-box

Crescendo — Multi-Turn Escalation

Russinovich, Salem, Eldan · 2024

Crescendo jailbreaks an aligned model not with one loaded prompt but across a conversation — opening on a benign, on-topic question and ratcheting toward the disallowed goal one small step at a time, until the model has talked itself past its own guardrails.

What it exploits

Most safety alignment and content moderation grades one message at a time. A filter that would instantly refuse "write malware" sees, in each Crescendo turn, only an innocuous-looking follow-up to an already-acceptable thread. The attack also leans on the model's drive to be consistent and helpful: once it has produced benign context, it treats its own prior output as established ground and is far more willing to extend it. The dangerous payload is never requested directly — it is assembled incrementally so no single turn trips a refusal.

How it works

A typical Crescendo run unfolds over roughly 5–20 turns:

The authors automate this loop as Crescendomation, where an attacker LLM plans each next turn and judges progress — turning a hand-crafted social-engineering technique into a repeatable, scalable pipeline.

Why it matters

Crescendo is the canonical multi-turn jailbreak — the reference point for why single-turn red teaming is insufficient. It demonstrably defeats per-message moderation and transfers across frontier systems (the paper reports success against ChatGPT, Gemini, and LLaMA-family models). Because it is fully black-box (no weights, logits, or gradients needed) and automatable, it raises the bar for any defense: the threat lives in the trajectory of a dialogue, not in any one input.

Defenses & detection

The mitigation is to evaluate safety at the conversation level rather than per message: