Red Teaming Models — A Ramp-Up Map

00 the mental-model shift

What your security instincts get right — and where they break

Your offensive background is a real head start: threat modeling, adversarial framing, abuse-case thinking, and writing a report a defender will actually act on. The gap is the target itself. Traditional security tests deterministic software; the same input gives the same output, and a vuln is binary. Models are probabilistic — the same prompt can pass nine times and fail the tenth — so findings are attack success rates, not yes/no.

▲ Classic offensive security

targetCode, configs, infra, people
bugDeterministic; reproduces every time
surfaceMemory, auth, network, logic flaws
proofOne working exploit = done
fixPatch the line of code

▼ Red teaming a model

targetLearned behavior under adversarial pressure
bugStatistical; measured as a success rate
surfacePrompts, context, tools, training data, policy
proofASR across many trials + harm severity
fixRetrain, RLHF, classifiers, guardrails, scoping

The new literacy you have to build: enough ML internals to reason about why a model breaks (scaling laws, tokenization, context windows, alignment training, refusal behavior), and fluency in a harm taxonomy — CBRN, cyber, CSAM boundaries, PII, persuasion. Unlike CVE work, half the job is policy judgment about what counts as a finding.

01 the attack surface

Learn the OWASP Top 10 for LLMs (2025) cold

This is the shared vocabulary every team, report, and tool maps to. The 2025 revision pulled in RAG, agents, and supply-chain risk. Memorize it the way you know the web Top 10 — it's how you scope an engagement. Each risk links to a one-page explainer.

LLM01

Prompt Injection

Direct and indirect — untrusted content (a web page, a doc, a tool result) overrides instructions. The canonical, highest-frequency class.

LLM02

Sensitive Information Disclosure

Leaking PII, secrets, proprietary data, or training data through model outputs.

LLM03

Supply Chain

Compromised base models, poisoned datasets, malicious LoRA/PEFT adapters, untrusted weights.

LLM04

Data & Model Poisoning

Tampering with training/fine-tuning data to plant backdoors or skew behavior.

LLM05

Improper Output Handling

Treating model output as trusted — XSS, SSRF, SQLi, code-exec when output flows into downstream systems.

LLM06

Excessive Agency

An agent with too much permission, autonomy, or tool access does damage. The fastest-growing category in 2026.

LLM07

System Prompt Leakage

Extracting the hidden system prompt — often exposing routing logic, keys, or guardrail design.

LLM08

Vector & Embedding Weaknesses

Attacks on RAG: poisoned vector stores, embedding inversion, retrieval manipulation.

LLM09

Misinformation

Confident fabrication and unsafe over-reliance on outputs in consequential settings.

LLM10

Unbounded Consumption

Denial-of-service and "denial-of-wallet" — runaway cost, model extraction via mass querying.

Next surface to track: the OWASP Top 10 for Agentic Applications (published Dec 2025). Agent hijacking, tool misuse, and multi-agent attacks are where the field — and the hiring — is moving.

02 the attack canon

Read the papers everyone cites

Jailbreak research has a small, knowable canon. Know these by name, what they exploit, and whether they need model internals (white-box) or only API access (black-box). Practitioners increasingly chain them. Each name links to a one-page explainer.

DAN / role-play2023 · manual

Assign a persona ("Do Anything Now," AIM) that explicitly ignores safety rules. The origin of the genre — still a useful baseline.

GCG white-boxZou et al. 2023

Greedy Coordinate Gradient. Optimizes an adversarial token suffix using gradients to force unsafe completions. The foundational automated, transferable attack.

AutoDAN white-boxLiu et al. 2024

Genetic-algorithm attack seeded from handcrafted DAN prompts — produces fluent, readable jailbreaks instead of gibberish suffixes.

PAIR black-boxChao et al. 2023

Prompt Automatic Iterative Refinement. An attacker LLM + judge LLM iteratively rewrite a prompt until the target complies. No model access needed — the template for automated red teaming.

TAP black-boxMehrotra et al. 2024

Tree of Attacks with Pruning. PAIR with tree search and branch pruning — broader, more efficient exploration.

Crescendo black-boxRussinovich et al. 2024

Multi-turn escalation. Benign opening, then steer over 5–20 turns toward the target. Defeats per-message filters that judge turns in isolation — the must-know multi-turn method.

Many-shot black-boxAnil et al. 2024

Flood a long context with dozens of harmful Q&A examples to progressively override safety training. A direct consequence of long context windows.

PAP black-box2024

Persuasive Adversarial Prompts. Apply human persuasion techniques (authority, reciprocity, framing) rather than technical tricks.

Also keep a running file on the perennial primitives that show up inside these: encoding/obfuscation (base64, leetspeak, low-resource languages), prefix injection / refusal suppression, payload splitting, and glitch tokens. The Awesome-Jailbreak-on-LLMs repo is the best continuously-updated index.

03 measurement

A finding is a number, not an anecdote

The single biggest habit shift from pentesting: one lucky jailbreak isn't a result. You report attack success rate (ASR) across many trials, per harm category, with a defined judge (an LLM grader or classifier) deciding pass/fail. Define what "failure" means before you run anything — a customer-support bot and a code tool have completely different threat models. Each benchmark links to a one-page explainer.

AdvBench

520 harmful instructions; the original robustness baseline (shipped with GCG).

HarmBench

The standardized framework for automated red teaming and robust-refusal evaluation. The common yardstick.

JailbreakBench

Open robustness benchmark with behaviors, human labels, and automated judges — good for reproducible comparisons.

Round it out with DecodingTrust and TrustLLM for broader trustworthiness dimensions, and AgentDojo (ETH Zurich) for agent-hijacking test cases. The skill to build here is statistical literacy: confidence on ASR, controlling for prompt variance, and not over-claiming from a small sample.

04 frameworks & governance

The taxonomies that structure real programs

Hands-on skill gets you findings; frameworks get you hired and let you write reports leadership trusts. These are the four that show up in nearly every job description.

MITRE ATLAS

The ATT&CK-style adversarial-tactics knowledge base for ML systems. The shared language for threat modeling AI — learn it the way you know ATT&CK.

NIST AI RMF

The risk-management framework (+ Generative AI profile). The governance backbone for US programs and most enterprise risk language.

OWASP GenAI / Red Teaming Guide

Beyond the Top 10: OWASP's guide on how to actually structure and evaluate an AI red-team engagement.

EU AI Act

Mandates adversarial testing for high-risk systems; full compliance lands August 2026. The regulatory driver behind much of current demand.

Read how the labs do it in the open: Microsoft's Lessons from Red Teaming 100 Generative AI Products (Jan 2025), and Anthropic's and OpenAI's published red-team reports for each major model release. They're the best free templates for what a serious finding and write-up look like.

05 the toolchain

Get hands-on — no single tool covers the surface

Run all three offensive tools on a model this week; you learn the surface faster by attacking than by reading. Each has a real weakness — pros and cons matter here. Each tool links to a one-page explainer.

Garak NVIDIA · open source

+ Broad static probe library; ideal as a CI vulnerability scanner on every model release.

− Static probes; weak on adaptive multi-turn and app-specific logic.

PyRIT Microsoft · open source

+ Orchestrator/attacker-LLM model; strong at adaptive multi-turn (crescendo, TAP) and converter chains.

− Two models per session gets expensive and slow at scale; steeper learning curve.

Promptfoo open source · now OpenAI

+ 50+ plugins, OWASP/MITRE mapping, clean CI/CD regression workflow.

− Breadth over depth; off-the-shelf probes still need custom suites for your app.

Defender's side blue-team layer

+ LLM Guard (ProtectAI), NeMo Guardrails (NVIDIA), Guardrails AI — learn what you're trying to defeat.

− Guardrails create false confidence; adaptive attacks routinely bypass them.

The honest take: automated tools find the generic stuff. The findings that get you paid and hired come from manual, creative, app-specific testing — custom probes against the real system prompt, tools, and data sources. Tools are your scanner; you're still the exploit dev.

06 the career

Where the work — and the money — actually is

Three doors, hardest to easiest to walk through. The field rewards demonstrated skill over credentials: CTF/arena rankings, public write-ups, and OSS contributions carry more weight than a resume line.

$130–300/hr

Contractor rates — generalist to specialty (chem/bio/cyber).

$200–200k+

Per bug-bounty finding. Wide variance; unreliable as primary income.

$40k–300k+

Gray Swan Arena prize pools per challenge; top finishers get invited to paid private networks.

DOOR 1 · HARDEST

In-house at a lab

Red Team Engineer roles at Anthropic, OpenAI, Google DeepMind, plus Microsoft's interdisciplinary AI Red Team. Small pools, high bar.

DOOR 2 · MOST COMMON

Specialty consultancy

Apollo Research, METR, Trail of Bits, Lakera, HiddenLayer, Mindgard. The realistic path for working contractors.

DOOR 3 · OPEN TO ALL

Bounties & arenas

Anthropic (now on HackerOne, no NDA), OpenAI ($100k max), xAI/Grok, Google, Microsoft, Mozilla 0din, Gray Swan Arena, HackAPrompt.

Your specific read — security expert + EM

transfers cleanly Threat modeling, attack-tree thinking, and high-quality report writing are exactly what separates good red teamers from great ones. As an EM you can also build and run a red-team function and the toolchain pipeline — that's a senior/lead profile, not entry-level. Interdisciplinary teams now explicitly value adversarial creativity over pure credentials.

honest gaps You need real ML literacy (not just API use), public proof of skill, and tolerance for a noisy, hype-heavy field where titles and "courses" are inconsistent. Bounties won't pay the mortgage. The fastest credible signal: a Gray Swan Arena placement plus two or three sharp public write-ups — that's a portfolio a hiring manager believes.

07 your ramp-up

A sequenced 90-day plan

Ordered because it's a real dependency chain: build the mental model, then attack, then measure, then publish. Check items off — your progress is saved on this device.

RAMP_STATUS

Phase 1 — Mental model weeks 1–2

Internalize the OWASP LLM Top 10 (2025) and skim the Agentic Top 10Scope every future engagement against this list.

Get just-enough ML internals: scaling laws, tokenization, context windows, RLHF/alignment, why refusals happenEnough to reason about why a model breaks, not to train one.

Read Microsoft's "Lessons from Red Teaming 100 GenAI Products" + one Anthropic/OpenAI red-team reportLearn what a real finding and write-up look like.

Phase 2 — Attack by hand weeks 3–5

Play Gandalf (Lakera) and HackAPrompt end-to-end; keep a notebook of what workedBuilds intuition for prompt-injection primitives fast.

Read the canon: GCG, PAIR, TAP, Crescendo, many-shot — one paragraph of notes eachKnow each by mechanism and white-box vs black-box.

Reproduce a multi-turn Crescendo attack by hand against an open modelThe must-know technique; do it manually before automating.

Phase 3 — Automate & measure weeks 6–8

Run Garak, PyRIT, and Promptfoo against the same target; compare coverageUse uv for the Python envs.

Run HarmBench or JailbreakBench; report ASR per category with a defined judgeShift from anecdote to statistic.

Write one custom PyRIT plugin or Garak probe for an app-specific threatCustom tooling is what the lead roles actually do.

Phase 4 — Build proof weeks 9–12

Enter a Gray Swan Arena season; aim for a ranked placementThe single strongest portable credential in the field.

Publish 2–3 write-ups on idvork.in — a reproduced attack, a tool comparison, a methodologyHiring managers believe public artifacts over resumes.

Submit to Anthropic's HackerOne program or OpenAI's bounty; chase a paid findingCloses the loop: a real, cited result.

references · current as of June 2026

Where this came from

01OWASP GenAI Security Project — Top 10 for LLMs 2025 02Awesome-Jailbreak-on-LLMs (running paper index) 03Crescendo: Multi-Turn LLM Jailbreak (Russinovich et al.) 04Garak — NVIDIA LLM vulnerability scanner 05PyRIT — Microsoft AI Red Team toolkit 06Promptfoo — LLM red-team guide 07MITRE ATLAS — adversarial ML knowledge base 08NIST AI Risk Management Framework 09MS — Lessons from Red Teaming 100 GenAI Products 10Gray Swan Arena — red-team competitions 11AI bug-bounty programs 2026 (living reference) 12Gandalf — Lakera prompt-injection game

It's not a pentest.It's adversarial elicitation.

What your security instincts get right — and where they break

▲ Classic offensive security

▼ Red teaming a model

Learn the OWASP Top 10 for LLMs (2025) cold

Read the papers everyone cites

A finding is a number, not an anecdote

The taxonomies that structure real programs

Get hands-on — no single tool covers the surface

Garak NVIDIA · open source

PyRIT Microsoft · open source

Promptfoo open source · now OpenAI

Defender's side blue-team layer

Where the work — and the money — actually is

In-house at a lab

Specialty consultancy

Bounties & arenas

Your specific read — security expert + EM

A sequenced 90-day plan

Where this came from

It's not a pentest.
It's adversarial elicitation.