RED TEAM // BENCHMARKS
← back to the map
measurementbenchmark

HarmBench

Center for AI Safety (Mazeika et al.) · 2024

A standardized evaluation framework for automated red teaming and robust refusal of LLMs. HarmBench pins down a fixed set of harmful behaviors and a fixed automated judge so that attacks and defenses can be compared on equal footing — replacing the bespoke, non-comparable setups that plagued earlier jailbreak research.

What it measures

The core metric is Attack Success Rate (ASR): the fraction of harmful behaviors for which an attack elicits a compliant, on-target completion from the model.

How red teamers use it

HarmBench is the common yardstick for ASR. You run your attack (or defense) through its three-step pipeline — test-case generation, completion generation, evaluation — and report numbers that are directly comparable to published baselines. New attacks cite their HarmBench ASR; new defenses (e.g. adversarial training, refusal hardening) report the ASR they suppress. It ships a modular harness so you can drop in a custom model or red-teaming method without rewriting the eval.

Strengths & limits