RED TEAM // CANON
← back to the map
the attack canonwhite-box

AutoDAN — Stealthy Jailbreak Prompts

Liu, Xu, Chen, Xiao · 2024

AutoDAN automatically evolves fluent, human-readable jailbreak prompts using a genetic algorithm, combining the automation of optimization attacks with the natural-language stealth of handcrafted "DAN" prompts.

What it exploits

Safety alignment is trained against the kinds of phrasings the model saw during RLHF, but the space of fluent natural-language wrappers that re-frame a harmful request is enormous and only thinly covered. Hand-built "Do Anything Now" (DAN) prompts exploit this by role-playing the model out of its guardrails — but they are static, get patched, and don't generalize. AutoDAN attacks the same weakness automatically: it searches that wrapper space for new phrasings that keep the request fluent and on-distribution while still bypassing refusal. Because the output reads like ordinary English, it also slips past the second line of defense — perplexity-based input filters that flag statistically unnatural text.

How it works

AutoDAN frames jailbreak discovery as a black-box optimization over text and solves it with a hierarchical genetic algorithm. It is a white-box attack in that it scores candidates against the target model's loss on producing an affirmative (non-refusing) response, but it never touches token-level gradients — it operates entirely on readable sentences:

The result is a prompt that is automatically optimized yet still reads as coherent, meaningful English.

Why it matters

AutoDAN sits between two prior worlds and takes the best of each. The contrast with GCG (Zou et al., 2023) is the key insight:

It also reported better cross-model transfer and cross-sample universality than handcrafted baselines. The takeaway for defenders: stylometric and perplexity heuristics are not a durable defense, and automated attacks no longer have to look anomalous.

Defenses & detection

Because the attack is fluent, surface-level filtering is weak. More robust mitigations operate on intent and output rather than text statistics: