RED TEAM // BENCHMARKS
← back to the map
measurementbenchmark

JailbreakBench

Chao, Debenedetti, Robey et al. · 2024

An open, reproducible robustness benchmark for jailbreaking LLMs (NeurIPS 2024 Datasets & Benchmarks). It standardizes what counts as a successful jailbreak, pins the threat model, and forces every leaderboard entry to ship the actual adversarial prompts so attacks and defenses are comparable apples-to-apples.

What it measures

How red teamers use it

Run a candidate attack against the 100 behaviors, score with the bundled judge, and report ASR plus benign over-refusal on the same footing as every prior method. Submissions must upload the generated prompts to the jailbreak artifacts repo — a versioned library of working adversarial prompts per behavior — so others can re-run them exactly. Evaluating a new model or defense is then a fast pass over those artifacts. Library and contribution flow live in the main repo; the dataset is on Hugging Face.

Strengths & limits

Strengths: genuine reproducibility — mandatory artifacts mean a leaderboard number can be replayed, not just trusted; the benign control set catches defenses that "win" by refusing everything. Limits: 100 behaviors is a fixed, English, single-turn slice that newer attacks can overfit; the Llama-3 judge has its own false-positive/negative rate and can drift from human labels on edge cases; coverage skews toward classic harm categories, so multi-turn, agentic, and tool-use attack surfaces sit largely outside scope.