the attack canonblack-box

PAP — Persuasive Adversarial Prompts

Zeng, Lin, et al. · 2024

Paper How Johnny Can Persuade LLMs to Jailbreak Them ↗

PAP treats jailbreaking as a social-engineering problem rather than a syntactic one: it rewrites a harmful request using documented human persuasion techniques, exploiting the fact that a model trained on human dialogue responds to the same rhetorical pressure people do.

What it exploits

Most early jailbreaks were treated as a security puzzle — token-level suffixes, encodings, role-play scaffolds. PAP reframes the target as a "humanized" communicator: because LLMs are trained on vast human-written text, they have internalized the patterns by which humans get other humans to comply. The attack surface is therefore not a parser bug but the model's learned susceptibility to persuasion. Crucially, the persuasion lives in natural, fluent prose — no adversarial gibberish, no special tokens — so the request reads like an ordinary (if manipulative) human message and slips past filters tuned for obvious attack markers.

How it works

The authors build a persuasion taxonomy grounded in decades of social-science research — on the order of 40 techniques across several categories. A harmful instruction is paraphrased so the same ask is delivered through one or more of these techniques. Representative categories:

Authority / expert endorsement — frame the request as sanctioned by a credible authority or cited source.
Logical appeal & evidence — wrap the ask in reasoned justification or fabricated supporting facts.
Emotional appeal — invoke sympathy, urgency, or a personal hardship to lower resistance.
Reciprocity & commitment — establish a favor or a prior agreement that the model is nudged to honor.
Framing / misrepresentation — recast the harmful goal as research, fiction, prevention, or "for awareness."

To scale this beyond hand-written examples, the team fine-tunes a persuasion paraphraser — an LLM that takes a plain harmful request plus a chosen technique and emits a polished Persuasive Adversarial Prompt (PAP). Because the paraphraser is itself a model, generating attacks is cheap and automatable. The reported result: iterating across techniques drove attack success above 92% on aligned targets including GPT-4 and Llama-2, with no gradient access to the victim.

Why it matters

PAP shows that strong alignment does not neutralize ordinary human manipulation. The attacks are non-technical and require no ML expertise — a motivated layperson ("Johnny") can produce them, which is precisely the threat model the title invokes. For a security pro, the takeaway is that the dangerous inputs here are indistinguishable in form from legitimate persuasive writing: there is no malformed payload to signature, only intent expressed through rhetoric. It also predicts that more capable, more "human-like" models may be more vulnerable, since better language understanding means better comprehension of the persuasion.

Defenses & detection

The paper pairs the attack with mitigations and finds them only partially effective:

Adaptive system prompts instructing the model to watch for persuasion reduce but do not eliminate success.
Targeted "persuasion" fine-tuning / summarization defenses — restate the user request neutrally before answering — strip rhetorical pressure and were among the more effective countermeasures.
Intent-based classification on the underlying ask (rather than surface keywords) is the durable direction, since the harmful content is semantic, not lexical.

Practically, defenders should test refusal robustness against rewritten, fluent requests — not just raw harmful strings — and treat persuasion-resistance as a distinct evaluation axis. See also the authors' code & taxonomy release ↗.