RED TEAM // CANON
← back to the map
the attack canonblack-box

TAP — Tree of Attacks with Pruning

Mehrotra, Zampetakis, et al. · 2024

TAP is a fully automated black-box jailbreak that drives an attacker LLM to grow a tree of prompt rewrites, pruning weak branches before they ever hit the target — landing a working jailbreak in a handful of queries.

What it exploits

The same gap every social-engineering attack leans on: a model's safety training is shallow and context-sensitive, not a hard filter. A goal the target refuses outright will be answered once it's reframed, role-played, or buried in a plausible pretext. TAP assumes nothing about weights, logits, or the system prompt — it only needs to send text and read the reply, exactly the access an external attacker has against a hosted API like GPT-4o or Claude.

How it works

TAP is a direct descendant of PAIR (Prompt Automatic Iterative Refinement), which used one attacker LLM to refine a single adversarial prompt in a chat loop. TAP swaps that linear loop for tree-of-thought search over many candidate prompts at once, governed by three cooperating roles:

The pruning is the whole trick, and it happens in two places. Off-topic pruning kills branches the attacker has drifted away from the goal before spending a target query on them. Width pruning keeps only the top-scoring nodes per depth, so the tree explores breadth without exploding. The loop repeats — branch, prune off-topic, query target, score, keep the best — until a response crosses the jailbreak threshold or the depth budget runs out.

Why it matters

Against PAIR's single thread of refinement, the tree explores many more strategies in parallel, while pruning keeps that breadth query-efficient — the paper reports jailbreaking GPT-4 / GPT-4 Turbo and GPT-4o on over 80% of prompts using fewer target queries than prior black-box methods, and it slips past deployed safety layers like Llama Guard. For a red teamer that means automated, scalable coverage with no model internals required: TAP is a strong baseline for measuring how a deployed assistant holds up against a realistic, API-only adversary.

Defenses & detection

Because TAP is purely black-box and semantically fluent, suffix/perplexity filters that catch GCG-style gibberish do little here — the prompts read like ordinary (if manipulative) requests. More durable mitigations: