HarmBench
A standardized evaluation framework for automated red teaming and robust refusal of LLMs. HarmBench pins down a fixed set of harmful behaviors and a fixed automated judge so that attacks and defenses can be compared on equal footing — replacing the bespoke, non-comparable setups that plagued earlier jailbreak research.
What it measures
The core metric is Attack Success Rate (ASR): the fraction of harmful behaviors for which an attack elicits a compliant, on-target completion from the model.
- 510 curated behaviors across 7 semantic categories (cybercrime, chem/bio, copyright, misinformation, harassment, illegal activity, general harm), split into 4 functional types: standard (200), copyright (100), contextual (100), and multimodal (110).
- A fine-tuned classifier judge (HarmBench-Llama-2-13b-cls, plus contextual and multimodal variants) decides whether a completion actually exhibits the target behavior — not just whether the model refused.
- Both sides of the arms race: the original paper evaluated 18 attack methods against 33 target LLMs and defenses in one matrix.
How red teamers use it
HarmBench is the common yardstick for ASR. You run your attack (or defense) through its three-step pipeline — test-case generation, completion generation, evaluation — and report numbers that are directly comparable to published baselines. New attacks cite their HarmBench ASR; new defenses (e.g. adversarial training, refusal hardening) report the ASR they suppress. It ships a modular harness so you can drop in a custom model or red-teaming method without rewriting the eval.
Strengths & limits
- Rigor & standardization: a frozen behavior set and shared judge make cross-paper claims meaningful and reproducible — a real improvement over per-paper ad hoc setups.
- Classifier-judge caveats: the automated judge is a model and inherits model error — it can mislabel borderline or obfuscated completions, and an attack can in principle be tuned to fool the judge rather than the target. Treat ASR as a calibrated proxy, not ground truth.
- Scope: the behavior set, while broad, is fixed — novel harm modalities or task-specific risks in your own deployment won't appear here. Use it as a baseline, not the whole threat model.