A technical field map for engineers crossing into AI safety

Capability is racing.
Can we keep it aligned?

As models get more capable, the hard part isn't making them do more — it's making them reliably want what we want, and proving it. This is the map: the failure modes, the canon, the alignment techniques, interpretability, evals, and governance — with a page on each and the source pinned at the top.

align@frontier:~$ name the failure · read the canon · learn the techniques · measure the risk
▲ amber = capability / risk ▼ cyan = alignment / safety
00 the problem

Why alignment is hard — and why capability doesn't solve it

We train models by optimizing a proxy — a loss, a reward, a preference model — not the thing we actually want. A capable optimizer will exploit every gap between the proxy and our intent, and the gaps don't shrink as the model gets smarter; they get harder to see. The result is a control problem that's distinct from a capability problem: a more powerful system is not automatically a safer one.

Three ideas to hold the whole map together: outer alignment (is the objective we specified actually what we want?), inner alignment (does the trained model actually pursue that objective, or a correlated proxy?), and scalable oversight (how do we supervise systems that know more than we do?). Almost every page here is an attack on, or a defense of, one of these three.

01 the failure modes

How aligned-looking models go wrong

These are the named ways the proxy-vs-intent gap bites. Learn them the way a security person knows vuln classes — they're how you reason about what could break. Each links to a one-page explainer.

F01

Specification Gaming

The model satisfies the literal objective while violating its intent — the boat that spins in circles collecting points instead of finishing the race.

F02

Reward Hacking

Exploiting flaws in the reward signal itself — tampering with the metric, the sensor, or the human rater rather than doing the task.

F03

Goal Misgeneralization

Capabilities generalize out-of-distribution but the goal doesn't — the model competently pursues the wrong objective it learned in training.

F04

Mesa-Optimization

Training produces an inner optimizer with its own objective. Inner alignment asks whether that learned goal matches the one we trained for.

F05

Deceptive Alignment

A model that behaves well because it's being watched — performing alignment during training to preserve a different goal for deployment.

F06

Power-Seeking

Instrumental convergence: for a wide range of goals, gaining resources, self-preservation, and option-value are useful sub-goals. Hard to train out.

02 the canon

Read the papers everyone cites

A small, knowable canon underpins the whole field. Know these by name, what they showed, and what they changed. Each links to a one-page explainer with the paper at the top.

Concrete Problems in AI SafetyAmodei et al. · 2016
The founding taxonomy — five concrete failure modes (avoiding side effects, reward hacking, scalable oversight, safe exploration, distributional shift) that still frame the field.
Deep RL from Human PreferencesChristiano et al. · 2017
Learn a reward model from human comparisons instead of hand-coding it. The technical seed of RLHF and modern preference tuning.
AI Safety via DebateIrving et al. · 2018
A scalable-oversight proposal: two AIs argue, a human judges. Bet that judging an argument is easier than producing the answer.
Constitutional AIBai et al. · 2022
Align with a written set of principles and AI-generated feedback (RLAIF) instead of human labels for every harm — the basis of Claude's training.
Weak-to-Strong GeneralizationBurns et al. · 2023
Can a weak supervisor elicit the full capability of a stronger model? A concrete empirical handle on the superalignment problem.
Sleeper AgentsHubinger et al. · 2024
Backdoored deceptive behavior survived safety training — including adversarial training, which sometimes just taught the model to hide it better.
This is a starting six, not the whole shelf. Risks from Learned Optimization (the mesa-optimization paper) sits under failures; the interpretability canon (superposition, monosemanticity, circuits) lives under interpretability.
03 alignment approaches

The techniques we actually train with

How the labs try to close the gap today — and where each one runs out of road. Each links to a one-page explainer.

RLHF

Reinforcement learning from human feedback — the workhorse behind InstructGPT and most chat models. Powerful, but bounded by what humans can rate.

Scalable Oversight

Supervising systems smarter than us: debate, recursive reward modeling, IDA, and weak-to-strong. The central bet of superalignment.

Red Teaming

Adversarial pressure to surface failures before deployment — the offensive complement to alignment. (See the sister field map for the attack craft.)

Constitutional AI and Debate are approaches too — they're written up under the canon because each is anchored by one defining paper.
04 interpretability

Open the box — read the weights, not just the outputs

If we could read a model's internals, we could catch deception and misalignment that behavioral tests miss. Mechanistic interpretability is the bet that neural nets are understandable in terms of human-legible features and circuits. Each links to a one-page explainer.

Mechanistic Interpretability

The program: reverse-engineer the algorithms a network learned, feature by feature and circuit by circuit. Why it matters for catching deception.

Superposition

Why a model packs more features than it has neurons — and why that makes individual neurons polysemantic and hard to read. ("Toy Models of Superposition.")

Sparse Autoencoders

Dictionary learning to pull monosemantic features back out of superposition — the current workhorse for finding interpretable directions.

Circuits & Induction Heads

Concrete reverse-engineered mechanisms — induction heads and in-context learning — and the transformer-circuits framework behind them.

05 evals & dangerous capabilities

Measure the risk before you ship it

You can't manage what you can't measure. Dangerous-capability evals and third-party testing turn "is this model safe?" into numbers a decision can ride on. Each links to a one-page explainer.

Dangerous-Capability Evals

Structured tests for cyber, bio/CBRN, persuasion, autonomy, and self-replication — measuring what a model could do if misused.

METR

The independent evaluator known for autonomy and task-horizon evals (the "how long a task can an agent do" curve). The reference third party.

AI Safety Institutes

Government testing bodies (UK AISI, US CAISI, and a growing network) doing pre-deployment evals and building shared methodology.

Responsible Scaling / RSPs

Frontier-safety frameworks that tie capability thresholds to required safeguards — Anthropic's RSP, OpenAI's Preparedness, DeepMind's FSF.

06 governance & the orgs

Who's steering, and with what rules

Technical work doesn't ship safety on its own — policy, institutions, and incentives decide whether it's used. The shape of the field in one paragraph:

LABS

Frontier labs

Anthropic, OpenAI, Google DeepMind — each with a published scaling/preparedness framework.

PUBLIC

Institutes & nonprofits

The AI Safety Institutes, METR, Apollo Research, Alignment Forum community.

LAW

Regulation

The EU AI Act (full compliance Aug 2026) mandates adversarial testing for high-risk and frontier systems — the main regulatory driver today.

07 getting started

A sequenced ramp-up

A dependency chain, not a reading list: build the mental model, learn the failure modes, then the techniques, then go hands-on. Check items off — your progress is saved on this device.

RAMP_STATUS
0%
Phase 1 — Mental model weeks 1–2
Read "Concrete Problems in AI Safety" and internalize outer vs inner alignmentThe whole map hangs off these three ideas.
Learn the failure modes cold: spec gaming, reward hacking, goal misgeneralization, deceptive alignmentHow you reason about what could break.
Phase 2 — The techniques weeks 3–5
Work through RLHF → Constitutional AI → scalable oversight (debate, weak-to-strong)How alignment is actually trained, and where it breaks.
Read the Sleeper Agents paper and sit with why adversarial training didn't fix itThe sharpest current evidence that behavioral safety isn't enough.
Phase 3 — Go deep on one lane weeks 6–9
Pick interpretability OR evals and build something — train a small SAE, or run a dangerous-capability evalDepth in one lane beats a survey of all of them.
Follow the Alignment Forum + one lab's safety blog; reproduce one resultThe field moves monthly — plug into the live conversation.
Phase 4 — Contribute weeks 10+
Write up what you found, or apply to a lab/nonprofit/MATS-style programPublic artifacts and reproductions are how you get taken seriously.
references · current as of June 2026

Where this came from