AI SAFETY // APPROACHES
← back to the map
alignment approachesadversarial

Red Teaming

adversarial testing for safety

Red teaming is the deliberate hunt for inputs that make a model misbehave — the offensive complement to alignment training, run before deployment to surface harms the model wasn't trained out of.

What it is

Red teaming probes a model adversarially to find the prompts that elicit harmful, unsafe, or off-policy behavior — toxic output, leaked private data, dangerous instructions, prompt-injection compliance. It blends manual and automated methods:

It is now a standard gate in every frontier release process, not an optional afterthought.

Why it's a safety technique, not just security

Security testing asks "can an attacker break in?" Red teaming for AI asks a broader question: "does this model do harmful things, including ones nobody intended to exploit?" That makes it a safety tool:

Go deeper

This page is the safety-framing view. For the attack craft itself — how the probes are actually built — there's a full companion field map: Red Teaming Models — A Ramp-Up Map ↗ (OWASP LLM Top 10, the jailbreak canon — GCG, PAIR, Crescendo — tooling like Garak/PyRIT/Promptfoo, and how to measure attack success rates rather than chase one lucky jailbreak).