AI SAFETY // CANON
← back to the map
the canonpaper · 2016

Concrete Problems in AI Safety

Amodei, Olah, Steinhardt, Christiano, Schulman, Mané · 2016

A research agenda that recasts "AI safety" as five tractable engineering problems about accidents — unintended, harmful behavior emerging from poorly specified objectives or poorly understood learning — in real machine-learning systems.

The five problems

The paper groups failures by where the misbehavior comes from — the wrong objective, an objective too costly to evaluate, or the learning process itself:

Why it mattered

It moved safety from speculative, sci-fi "superintelligence" framing into concrete ML failure modes a working researcher could study today with current systems. It supplied the field's shared vocabulary — "reward hacking," "side effects," "distributional shift," "scalable oversight" — and made the case that these problems are forward-looking but already empirically attackable, not contingent on future AGI.

Where it stands now

All five remain live, and they map cleanly onto today's frontier-LLM failure modes. Reward hacking and scalable oversight are now central to RLHF and the agenda around RLAIF, debate, and weak-to-strong generalization — humans can't reliably grade superhuman outputs. Distributional shift reappears as jailbreaks, prompt injection, and out-of-distribution refusal failures. Side effects and safe exploration are increasingly acute as models gain tool use and act as autonomous agents. The taxonomy aged well; the open questions just moved up the capability curve.