the canonpaper · 2017

Deep RL from Human Preferences

Christiano, Leike, Brown, Martic, Legg, Amodei · 2017

PaperDeep RL from Human Preferences (arXiv) ↗

Instead of hand-coding a reward function, learn one: fit a reward model to humans' pairwise comparisons of short behavior clips, then optimize a policy against that learned reward.

What it showed

A deep RL agent can master tasks with no access to the true reward, supervised only by a human picking which of two ~1–2 second trajectory clips looks better. The comparisons train a reward model that stands in for the environment reward. With this signal the method:

Matched or beat reward-trained baselines on most Atari and MuJoCo (simulated robot) tasks using feedback on under 1% of the agent's interactions.
Taught behaviors with no natural reward function at all — the canonical example being a simulated Hopper learning a backflip from roughly 900 human comparisons (about an hour of a non-expert's time).

How it works

The system runs a loop with three components kept in sync:

Collect clips — the current policy acts; pairs of trajectory segments are sampled and shown to a human.
Human compares — the rater picks the better clip (or marks them equal/incomparable). No scores, no demonstrations — just preferences.
Fit the reward model — a network r̂ is trained so that the clip the human preferred gets higher predicted return, via a Bradley–Terry / softmax loss over the pair.
Optimize the policy — standard deep RL (A2C / TRPO) maximizes the predicted reward, generating fresh, on-distribution clips for the next round.

Comparisons are far more sample-efficient than they look: a few hundred to a few thousand binary judgments shape behavior that would otherwise need a carefully engineered, often-gameable reward function. Asking "which is better?" is also cheaper and more robust for humans than assigning absolute numeric rewards.

Why it mattered

This is the technical seed of RLHF. The exact recipe — a reward model learned from human preference comparisons, then policy optimization against it — is what later scaled to language models in InstructGPT and became the dominant approach to aligning modern LLMs with human intent.

InstructGPT — Training language models to follow instructions with human feedback (arXiv 2203.02155) ↗

Limits

The learned reward is only ever as good as the comparisons behind it. Humans must be able to judge the clips they see, so the method is bounded by what a rater can evaluate in a short window — and a policy will happily exploit gaps where the reward model misjudges (reward hacking). Tasks too complex, slow, or subtle for a human to assess from a clip fall outside its reach, which is precisely the gap that motivates scalable oversight.