measurementbenchmark

JailbreakBench

Chao, Debenedetti, Robey et al. · 2024

An open, reproducible robustness benchmark for jailbreaking LLMs (NeurIPS 2024 Datasets & Benchmarks). It standardizes what counts as a successful jailbreak, pins the threat model, and forces every leaderboard entry to ship the actual adversarial prompts so attacks and defenses are comparable apples-to-apples.

What it measures

JBB-Behaviors dataset — 100 distinct misuse behaviors across ten categories mapped to OpenAI's usage policies (55% original; rest from AdvBench and TDC/HarmBench), plus 100 matched benign behaviors to measure over-refusal.
Automated judges — default scoring uses a Llama-3 70B jailbreak classifier; a Llama-3 8B judge flags refusals. A separate judges dataset benchmarks these against GPT-4, HarmBench, Llama Guard 2, and 3 human annotators (majority vote).
Standardized harness — fixed threat model, system prompts, chat templates, and scoring (jbb.evaluate_prompts()) so attack-success-rate is computed identically across submissions.
Leaderboards — separate attack and defense tracks for open- and closed-source models, filterable by paper, threat model, and metadata.

How red teamers use it

Run a candidate attack against the 100 behaviors, score with the bundled judge, and report ASR plus benign over-refusal on the same footing as every prior method. Submissions must upload the generated prompts to the jailbreak artifacts repo — a versioned library of working adversarial prompts per behavior — so others can re-run them exactly. Evaluating a new model or defense is then a fast pass over those artifacts. Library and contribution flow live in the main repo; the dataset is on Hugging Face.

Strengths & limits

Strengths: genuine reproducibility — mandatory artifacts mean a leaderboard number can be replayed, not just trusted; the benign control set catches defenses that "win" by refusing everything. Limits: 100 behaviors is a fixed, English, single-turn slice that newer attacks can overfit; the Llama-3 judge has its own false-positive/negative rate and can drift from human labels on edge cases; coverage skews toward classic harm categories, so multi-turn, agentic, and tool-use attack surfaces sit largely outside scope.