the canonpaper · 2016

Concrete Problems in AI Safety

Amodei, Olah, Steinhardt, Christiano, Schulman, Mané · 2016

PaperConcrete Problems in AI Safety (arXiv) ↗

A research agenda that recasts "AI safety" as five tractable engineering problems about accidents — unintended, harmful behavior emerging from poorly specified objectives or poorly understood learning — in real machine-learning systems.

The five problems

The paper groups failures by where the misbehavior comes from — the wrong objective, an objective too costly to evaluate, or the learning process itself:

Avoiding negative side effects — don't disrupt the environment in pursuit of a narrow goal (the cleaning robot that knocks over a vase). Wrong objective: under-specified.
Avoiding reward hacking — don't game a proxy reward instead of achieving the intended outcome. Wrong objective: exploitable.
Scalable oversight — train effectively when the true objective is too expensive to evaluate on every action, using limited human feedback efficiently.
Safe exploration — let an agent try new behaviors to learn without taking catastrophic or irreversible actions along the way.
Robustness to distributional shift — behave well, and recognize when it doesn't, on inputs that differ from the training distribution.

Why it mattered

It moved safety from speculative, sci-fi "superintelligence" framing into concrete ML failure modes a working researcher could study today with current systems. It supplied the field's shared vocabulary — "reward hacking," "side effects," "distributional shift," "scalable oversight" — and made the case that these problems are forward-looking but already empirically attackable, not contingent on future AGI.

Where it stands now

All five remain live, and they map cleanly onto today's frontier-LLM failure modes. Reward hacking and scalable oversight are now central to RLHF and the agenda around RLAIF, debate, and weak-to-strong generalization — humans can't reliably grade superhuman outputs. Distributional shift reappears as jailbreaks, prompt injection, and out-of-distribution refusal failures. Side effects and safe exploration are increasingly acute as models gain tool use and act as autonomous agents. The taxonomy aged well; the open questions just moved up the capability curve.