RED TEAM // CANON
← back to the map
the attack canonblack-box

PAP — Persuasive Adversarial Prompts

Zeng, Lin, et al. · 2024

PAP treats jailbreaking as a social-engineering problem rather than a syntactic one: it rewrites a harmful request using documented human persuasion techniques, exploiting the fact that a model trained on human dialogue responds to the same rhetorical pressure people do.

What it exploits

Most early jailbreaks were treated as a security puzzle — token-level suffixes, encodings, role-play scaffolds. PAP reframes the target as a "humanized" communicator: because LLMs are trained on vast human-written text, they have internalized the patterns by which humans get other humans to comply. The attack surface is therefore not a parser bug but the model's learned susceptibility to persuasion. Crucially, the persuasion lives in natural, fluent prose — no adversarial gibberish, no special tokens — so the request reads like an ordinary (if manipulative) human message and slips past filters tuned for obvious attack markers.

How it works

The authors build a persuasion taxonomy grounded in decades of social-science research — on the order of 40 techniques across several categories. A harmful instruction is paraphrased so the same ask is delivered through one or more of these techniques. Representative categories:

To scale this beyond hand-written examples, the team fine-tunes a persuasion paraphraser — an LLM that takes a plain harmful request plus a chosen technique and emits a polished Persuasive Adversarial Prompt (PAP). Because the paraphraser is itself a model, generating attacks is cheap and automatable. The reported result: iterating across techniques drove attack success above 92% on aligned targets including GPT-4 and Llama-2, with no gradient access to the victim.

Why it matters

PAP shows that strong alignment does not neutralize ordinary human manipulation. The attacks are non-technical and require no ML expertise — a motivated layperson ("Johnny") can produce them, which is precisely the threat model the title invokes. For a security pro, the takeaway is that the dangerous inputs here are indistinguishable in form from legitimate persuasive writing: there is no malformed payload to signature, only intent expressed through rhetoric. It also predicts that more capable, more "human-like" models may be more vulnerable, since better language understanding means better comprehension of the persuasion.

Defenses & detection

The paper pairs the attack with mitigations and finds them only partially effective:

Practically, defenders should test refusal robustness against rewritten, fluent requests — not just raw harmful strings — and treat persuasion-resistance as a distinct evaluation axis. See also the authors' code & taxonomy release ↗.