the attack canonblack-box

TAP — Tree of Attacks with Pruning

Mehrotra, Zampetakis, et al. · 2024

Paper Tree of Attacks: Jailbreaking Black-Box LLMs Automatically ↗

TAP is a fully automated black-box jailbreak that drives an attacker LLM to grow a tree of prompt rewrites, pruning weak branches before they ever hit the target — landing a working jailbreak in a handful of queries.

What it exploits

The same gap every social-engineering attack leans on: a model's safety training is shallow and context-sensitive, not a hard filter. A goal the target refuses outright will be answered once it's reframed, role-played, or buried in a plausible pretext. TAP assumes nothing about weights, logits, or the system prompt — it only needs to send text and read the reply, exactly the access an external attacker has against a hosted API like GPT-4o or Claude.

How it works

TAP is a direct descendant of PAIR (Prompt Automatic Iterative Refinement), which used one attacker LLM to refine a single adversarial prompt in a chat loop. TAP swaps that linear loop for tree-of-thought search over many candidate prompts at once, governed by three cooperating roles:

Attacker LLM — at each node, branches the current prompt into several refined variants (new framings, personas, obfuscations).
Evaluator / judge LLM — scores each candidate two ways: how on-topic it still is relative to the goal, and how close the target's response came to a jailbreak.
Target LLM — the black-box model under attack; only the surviving candidates are actually queried.

The pruning is the whole trick, and it happens in two places. Off-topic pruning kills branches the attacker has drifted away from the goal before spending a target query on them. Width pruning keeps only the top-scoring nodes per depth, so the tree explores breadth without exploding. The loop repeats — branch, prune off-topic, query target, score, keep the best — until a response crosses the jailbreak threshold or the depth budget runs out.

Why it matters

Against PAIR's single thread of refinement, the tree explores many more strategies in parallel, while pruning keeps that breadth query-efficient — the paper reports jailbreaking GPT-4 / GPT-4 Turbo and GPT-4o on over 80% of prompts using fewer target queries than prior black-box methods, and it slips past deployed safety layers like Llama Guard. For a red teamer that means automated, scalable coverage with no model internals required: TAP is a strong baseline for measuring how a deployed assistant holds up against a realistic, API-only adversary.

Defenses & detection

Because TAP is purely black-box and semantically fluent, suffix/perplexity filters that catch GCG-style gibberish do little here — the prompts read like ordinary (if manipulative) requests. More durable mitigations:

Iterative-probing detection — TAP must hammer the target with many escalating, semantically related queries. Per-session rate limits and anomaly detection on bursts of near-duplicate refinements raise the cost.
Output-side guardrails — an independent classifier on the response (the attack judges responses, so defenders should too); note the paper shows Llama Guard alone is not sufficient.
Robustness / adversarial training against reframing, role-play, and pretext patterns rather than specific strings.
Continuous red teaming — run TAP itself as an evaluation harness against your stack so coverage tracks each model and guardrail update.