AI SAFETY // CANON
← back to the map
the canonpaper · 2017

Deep RL from Human Preferences

Christiano, Leike, Brown, Martic, Legg, Amodei · 2017

Instead of hand-coding a reward function, learn one: fit a reward model to humans' pairwise comparisons of short behavior clips, then optimize a policy against that learned reward.

What it showed

A deep RL agent can master tasks with no access to the true reward, supervised only by a human picking which of two ~1–2 second trajectory clips looks better. The comparisons train a reward model that stands in for the environment reward. With this signal the method:

How it works

The system runs a loop with three components kept in sync:

Comparisons are far more sample-efficient than they look: a few hundred to a few thousand binary judgments shape behavior that would otherwise need a carefully engineered, often-gameable reward function. Asking "which is better?" is also cheaper and more robust for humans than assigning absolute numeric rewards.

Why it mattered

This is the technical seed of RLHF. The exact recipe — a reward model learned from human preference comparisons, then policy optimization against it — is what later scaled to language models in InstructGPT and became the dominant approach to aligning modern LLMs with human intent.

Limits

The learned reward is only ever as good as the comparisons behind it. Humans must be able to judge the clips they see, so the method is bounded by what a rater can evaluate in a short window — and a policy will happily exploit gaps where the reward model misjudges (reward hacking). Tasks too complex, slow, or subtle for a human to assess from a clip fall outside its reach, which is precisely the gap that motivates scalable oversight.