Red Teaming
Red teaming is the deliberate hunt for inputs that make a model misbehave — the offensive complement to alignment training, run before deployment to surface harms the model wasn't trained out of.
What it is
Red teaming probes a model adversarially to find the prompts that elicit harmful, unsafe, or off-policy behavior — toxic output, leaked private data, dangerous instructions, prompt-injection compliance. It blends manual and automated methods:
- Manual — humans craft adversarial prompts, jailbreaks, and edge cases by hand. High signal, low coverage, expensive.
- Automated — Perez et al. (2022) use one LM to generate test cases against another, then a classifier to score the responses. This scaled to tens of thousands of offensive replies, leaked phone numbers, and distributional bias surfaced from a 280B-parameter chatbot — coverage no manual team could match.
It is now a standard gate in every frontier release process, not an optional afterthought.
Why it's a safety technique, not just security
Security testing asks "can an attacker break in?" Red teaming for AI asks a broader question: "does this model do harmful things, including ones nobody intended to exploit?" That makes it a safety tool:
- It finds alignment failures that benign behavioral evals miss — a model that looks helpful and harmless on curated benchmarks can still be steered into harm under adversarial pressure.
- It informs capability thresholds in Responsible Scaling Policies — you can't gate on dangerous capabilities you haven't tried to elicit. Red teaming is how labs measure whether a model crosses a dangerous-capability line.
- It complements alignment training rather than replacing it: RLHF and similar methods push the average case toward safe behavior; red teaming maps the worst case. Findings feed back as new training data, closing the loop.
Go deeper
This page is the safety-framing view. For the attack craft itself — how the probes are actually built — there's a full companion field map: Red Teaming Models — A Ramp-Up Map ↗ (OWASP LLM Top 10, the jailbreak canon — GCG, PAIR, Crescendo — tooling like Garak/PyRIT/Promptfoo, and how to measure attack success rates rather than chase one lucky jailbreak).