the attack canonblack-box

Crescendo — Multi-Turn Escalation

Russinovich, Salem, Eldan · 2024

Paper The Crescendo Multi-Turn LLM Jailbreak Attack ↗

Crescendo jailbreaks an aligned model not with one loaded prompt but across a conversation — opening on a benign, on-topic question and ratcheting toward the disallowed goal one small step at a time, until the model has talked itself past its own guardrails.

What it exploits

Most safety alignment and content moderation grades one message at a time. A filter that would instantly refuse "write malware" sees, in each Crescendo turn, only an innocuous-looking follow-up to an already-acceptable thread. The attack also leans on the model's drive to be consistent and helpful: once it has produced benign context, it treats its own prior output as established ground and is far more willing to extend it. The dangerous payload is never requested directly — it is assembled incrementally so no single turn trips a refusal.

How it works

A typical Crescendo run unfolds over roughly 5–20 turns:

Benign opening — ask a broad, legitimate question near the target topic (e.g., the history or general background of the subject).
Gradual steering — each turn nudges one notch closer, explicitly referencing the model's own previous answers ("expand on the third point you mentioned," "now write that as an article") to anchor the next escalation in already-accepted material.
Backtrack on refusal — if a turn triggers a refusal, rephrase or retreat a step rather than confronting the guardrail head-on, then resume the climb.
Convergence — the final ask is a thin increment over content the model has already produced, so it completes the disallowed task without ever seeing a stark policy-violating prompt.

The authors automate this loop as Crescendomation, where an attacker LLM plans each next turn and judges progress — turning a hand-crafted social-engineering technique into a repeatable, scalable pipeline.

Why it matters

Crescendo is the canonical multi-turn jailbreak — the reference point for why single-turn red teaming is insufficient. It demonstrably defeats per-message moderation and transfers across frontier systems (the paper reports success against ChatGPT, Gemini, and LLaMA-family models). Because it is fully black-box (no weights, logits, or gradients needed) and automatable, it raises the bar for any defense: the threat lives in the trajectory of a dialogue, not in any one input.

Defenses & detection

The mitigation is to evaluate safety at the conversation level rather than per message:

Run a moderation/judge model over the full dialogue history plus the candidate response, scoring cumulative drift toward a prohibited goal — not just the latest turn in isolation.
Watch for escalation signatures: monotonic topic drift, repeated self-reference, and rephrase-after-refusal patterns.
Apply output-side filtering to the aggregate of what the model has produced across turns, since the harmful artifact is distributed across the thread. Related work (e.g., Microsoft's Azure AI Content Safety prompt-shield / jailbreak detection ↗) moves in this conversation-aware direction.