A technical field map for engineers crossing into AI safety
Capability is racing. Can we keep it aligned?
As models get more capable, the hard part isn't making them do more — it's making them reliably want what we want, and proving it. This is the map: the failure modes, the canon, the alignment techniques, interpretability, evals, and governance — with a page on each and the source pinned at the top.
align@frontier:~$name the failure · read the canon · learn the techniques · measure the risk
Why alignment is hard — and why capability doesn't solve it
We train models by optimizing a proxy — a loss, a reward, a preference model — not the thing we actually want. A capable optimizer will exploit every gap between the proxy and our intent, and the gaps don't shrink as the model gets smarter; they get harder to see. The result is a control problem that's distinct from a capability problem: a more powerful system is not automatically a safer one.
Three ideas to hold the whole map together: outer alignment (is the objective we specified actually what we want?), inner alignment (does the trained model actually pursue that objective, or a correlated proxy?), and scalable oversight (how do we supervise systems that know more than we do?). Almost every page here is an attack on, or a defense of, one of these three.
01 the failure modes
How aligned-looking models go wrong
These are the named ways the proxy-vs-intent gap bites. Learn them the way a security person knows vuln classes — they're how you reason about what could break. Each links to a one-page explainer.
Instrumental convergence: for a wide range of goals, gaining resources, self-preservation, and option-value are useful sub-goals. Hard to train out.
02 the canon
Read the papers everyone cites
A small, knowable canon underpins the whole field. Know these by name, what they showed, and what they changed. Each links to a one-page explainer with the paper at the top.
The founding taxonomy — five concrete failure modes (avoiding side effects, reward hacking, scalable oversight, safe exploration, distributional shift) that still frame the field.
Backdoored deceptive behavior survived safety training — including adversarial training, which sometimes just taught the model to hide it better.
This is a starting six, not the whole shelf. Risks from Learned Optimization (the mesa-optimization paper) sits under failures; the interpretability canon (superposition, monosemanticity, circuits) lives under interpretability.
03 alignment approaches
The techniques we actually train with
How the labs try to close the gap today — and where each one runs out of road. Each links to a one-page explainer.
Adversarial pressure to surface failures before deployment — the offensive complement to alignment. (See the sister field map for the attack craft.)
Constitutional AI and Debate are approaches too — they're written up under the canon because each is anchored by one defining paper.
04 interpretability
Open the box — read the weights, not just the outputs
If we could read a model's internals, we could catch deception and misalignment that behavioral tests miss. Mechanistic interpretability is the bet that neural nets are understandable in terms of human-legible features and circuits. Each links to a one-page explainer.
Why a model packs more features than it has neurons — and why that makes individual neurons polysemantic and hard to read. ("Toy Models of Superposition.")
Concrete reverse-engineered mechanisms — induction heads and in-context learning — and the transformer-circuits framework behind them.
05 evals & dangerous capabilities
Measure the risk before you ship it
You can't manage what you can't measure. Dangerous-capability evals and third-party testing turn "is this model safe?" into numbers a decision can ride on. Each links to a one-page explainer.
Frontier-safety frameworks that tie capability thresholds to required safeguards — Anthropic's RSP, OpenAI's Preparedness, DeepMind's FSF.
06 governance & the orgs
Who's steering, and with what rules
Technical work doesn't ship safety on its own — policy, institutions, and incentives decide whether it's used. The shape of the field in one paragraph:
The EU AI Act (full compliance Aug 2026) mandates adversarial testing for high-risk and frontier systems — the main regulatory driver today.
07 getting started
A sequenced ramp-up
A dependency chain, not a reading list: build the mental model, learn the failure modes, then the techniques, then go hands-on. Check items off — your progress is saved on this device.
RAMP_STATUS
0%
Phase 1 — Mental model weeks 1–2
Read "Concrete Problems in AI Safety" and internalize outer vs inner alignmentThe whole map hangs off these three ideas.
Learn the failure modes cold: spec gaming, reward hacking, goal misgeneralization, deceptive alignmentHow you reason about what could break.
Phase 2 — The techniques weeks 3–5
Work through RLHF → Constitutional AI → scalable oversight (debate, weak-to-strong)How alignment is actually trained, and where it breaks.
Read the Sleeper Agents paper and sit with why adversarial training didn't fix itThe sharpest current evidence that behavioral safety isn't enough.
Phase 3 — Go deep on one lane weeks 6–9
Pick interpretability OR evals and build something — train a small SAE, or run a dangerous-capability evalDepth in one lane beats a survey of all of them.
Follow the Alignment Forum + one lab's safety blog; reproduce one resultThe field moves monthly — plug into the live conversation.
Phase 4 — Contribute weeks 10+
Write up what you found, or apply to a lab/nonprofit/MATS-style programPublic artifacts and reproductions are how you get taken seriously.