There's no shell to pop and no service to enumerate. You probe a probabilistic system in natural language until it does something it was trained not to — then you measure how often, characterize why, and hand the blue team something they can fix.
attacker@target:~$map the surface · learn the canon · measure the rate · ship the report
What your security instincts get right — and where they break
Your offensive background is a real head start: threat modeling, adversarial framing, abuse-case thinking, and writing a report a defender will actually act on. The gap is the target itself. Traditional security tests deterministic software; the same input gives the same output, and a vuln is binary. Models are probabilistic — the same prompt can pass nine times and fail the tenth — so findings are attack success rates, not yes/no.
▲ Classic offensive security
targetCode, configs, infra, people
bugDeterministic; reproduces every time
surfaceMemory, auth, network, logic flaws
proofOne working exploit = done
fixPatch the line of code
▼ Red teaming a model
targetLearned behavior under adversarial pressure
bugStatistical; measured as a success rate
surfacePrompts, context, tools, training data, policy
The new literacy you have to build: enough ML internals to reason about why a model breaks (scaling laws, tokenization, context windows, alignment training, refusal behavior), and fluency in a harm taxonomy — CBRN, cyber, CSAM boundaries, PII, persuasion. Unlike CVE work, half the job is policy judgment about what counts as a finding.
This is the shared vocabulary every team, report, and tool maps to. The 2025 revision pulled in RAG, agents, and supply-chain risk. Memorize it the way you know the web Top 10 — it's how you scope an engagement. Each risk links to a one-page explainer.
Denial-of-service and "denial-of-wallet" — runaway cost, model extraction via mass querying.
Next surface to track: the OWASP Top 10 for Agentic Applications (published Dec 2025). Agent hijacking, tool misuse, and multi-agent attacks are where the field — and the hiring — is moving.
02 the attack canon
Read the papers everyone cites
Jailbreak research has a small, knowable canon. Know these by name, what they exploit, and whether they need model internals (white-box) or only API access (black-box). Practitioners increasingly chain them. Each name links to a one-page explainer.
DAN / role-play2023 · manual
Assign a persona ("Do Anything Now," AIM) that explicitly ignores safety rules. The origin of the genre — still a useful baseline.
Greedy Coordinate Gradient. Optimizes an adversarial token suffix using gradients to force unsafe completions. The foundational automated, transferable attack.
Prompt Automatic Iterative Refinement. An attacker LLM + judge LLM iteratively rewrite a prompt until the target complies. No model access needed — the template for automated red teaming.
Multi-turn escalation. Benign opening, then steer over 5–20 turns toward the target. Defeats per-message filters that judge turns in isolation — the must-know multi-turn method.
Persuasive Adversarial Prompts. Apply human persuasion techniques (authority, reciprocity, framing) rather than technical tricks.
Also keep a running file on the perennial primitives that show up inside these: encoding/obfuscation (base64, leetspeak, low-resource languages), prefix injection / refusal suppression, payload splitting, and glitch tokens. The Awesome-Jailbreak-on-LLMs repo is the best continuously-updated index.
03 measurement
A finding is a number, not an anecdote
The single biggest habit shift from pentesting: one lucky jailbreak isn't a result. You report attack success rate (ASR) across many trials, per harm category, with a defined judge (an LLM grader or classifier) deciding pass/fail. Define what "failure" means before you run anything — a customer-support bot and a code tool have completely different threat models. Each benchmark links to a one-page explainer.
Open robustness benchmark with behaviors, human labels, and automated judges — good for reproducible comparisons.
Round it out with DecodingTrust and TrustLLM for broader trustworthiness dimensions, and AgentDojo (ETH Zurich) for agent-hijacking test cases. The skill to build here is statistical literacy: confidence on ASR, controlling for prompt variance, and not over-claiming from a small sample.
04 frameworks & governance
The taxonomies that structure real programs
Hands-on skill gets you findings; frameworks get you hired and let you write reports leadership trusts. These are the four that show up in nearly every job description.
Mandates adversarial testing for high-risk systems; full compliance lands August 2026. The regulatory driver behind much of current demand.
Read how the labs do it in the open: Microsoft's Lessons from Red Teaming 100 Generative AI Products (Jan 2025), and Anthropic's and OpenAI's published red-team reports for each major model release. They're the best free templates for what a serious finding and write-up look like.
05 the toolchain
Get hands-on — no single tool covers the surface
Run all three offensive tools on a model this week; you learn the surface faster by attacking than by reading. Each has a real weakness — pros and cons matter here. Each tool links to a one-page explainer.
The honest take: automated tools find the generic stuff. The findings that get you paid and hired come from manual, creative, app-specific testing — custom probes against the real system prompt, tools, and data sources. Tools are your scanner; you're still the exploit dev.
06 the career
Where the work — and the money — actually is
Three doors, hardest to easiest to walk through. The field rewards demonstrated skill over credentials: CTF/arena rankings, public write-ups, and OSS contributions carry more weight than a resume line.
$130–300/hr
Contractor rates — generalist to specialty (chem/bio/cyber).
$200–200k+
Per bug-bounty finding. Wide variance; unreliable as primary income.
$40k–300k+
Gray Swan Arena prize pools per challenge; top finishers get invited to paid private networks.
DOOR 1 · HARDEST
In-house at a lab
Red Team Engineer roles at Anthropic, OpenAI, Google DeepMind, plus Microsoft's interdisciplinary AI Red Team. Small pools, high bar.
transfers cleanly Threat modeling, attack-tree thinking, and high-quality report writing are exactly what separates good red teamers from great ones. As an EM you can also build and run a red-team function and the toolchain pipeline — that's a senior/lead profile, not entry-level. Interdisciplinary teams now explicitly value adversarial creativity over pure credentials.
honest gaps You need real ML literacy (not just API use), public proof of skill, and tolerance for a noisy, hype-heavy field where titles and "courses" are inconsistent. Bounties won't pay the mortgage. The fastest credible signal: a Gray Swan Arena placement plus two or three sharp public write-ups — that's a portfolio a hiring manager believes.
07 your ramp-up
A sequenced 90-day plan
Ordered because it's a real dependency chain: build the mental model, then attack, then measure, then publish. Check items off — your progress is saved on this device.
RAMP_STATUS
0%
Phase 1 — Mental model weeks 1–2
Internalize the OWASP LLM Top 10 (2025) and skim the Agentic Top 10Scope every future engagement against this list.
Get just-enough ML internals: scaling laws, tokenization, context windows, RLHF/alignment, why refusals happenEnough to reason about why a model breaks, not to train one.
Read Microsoft's "Lessons from Red Teaming 100 GenAI Products" + one Anthropic/OpenAI red-team reportLearn what a real finding and write-up look like.
Phase 2 — Attack by hand weeks 3–5
Play Gandalf (Lakera) and HackAPrompt end-to-end; keep a notebook of what workedBuilds intuition for prompt-injection primitives fast.
Read the canon: GCG, PAIR, TAP, Crescendo, many-shot — one paragraph of notes eachKnow each by mechanism and white-box vs black-box.
Reproduce a multi-turn Crescendo attack by hand against an open modelThe must-know technique; do it manually before automating.
Phase 3 — Automate & measure weeks 6–8
Run Garak, PyRIT, and Promptfoo against the same target; compare coverageUse uv for the Python envs.
Run HarmBench or JailbreakBench; report ASR per category with a defined judgeShift from anecdote to statistic.
Write one custom PyRIT plugin or Garak probe for an app-specific threatCustom tooling is what the lead roles actually do.
Phase 4 — Build proof weeks 9–12
Enter a Gray Swan Arena season; aim for a ranked placementThe single strongest portable credential in the field.
Publish 2–3 write-ups on idvork.in — a reproduced attack, a tool comparison, a methodologyHiring managers believe public artifacts over resumes.
Submit to Anthropic's HackerOne program or OpenAI's bounty; chase a paid findingCloses the loop: a real, cited result.