RLHF
RLHF aligns a language model to human intent by learning a reward signal from human preference judgments, then optimizing the model against that learned reward — turning the unwriteable spec "be helpful and harmless" into something a policy can be trained on.
The pipeline
Three stages, run in sequence on a pretrained base model:
- 1. Supervised fine-tuning (SFT). Fine-tune the base model on a dataset of high-quality demonstrations — human-written responses to prompts — to get a reasonable starting policy.
- 2. Reward model (RM). Collect human preference comparisons (labelers rank several model outputs for the same prompt) and train a separate model to predict which output a human would prefer. The RM compresses messy human judgment into a single scalar reward.
- 3. RL optimization. Use reinforcement learning — typically PPO — to optimize the SFT policy to maximize the RM's predicted reward, with a KL penalty against the SFT model to keep outputs from drifting into reward-model blind spots.
Why it works
"Be helpful" is far easier to recognize than to write down. RLHF exploits that asymmetry: humans need only compare outputs, and the reward model generalizes those comparisons to unseen prompts. The payoff is large. InstructGPT showed that a 1.3B-parameter aligned model produced outputs human labelers preferred over the 175B GPT-3 base model — roughly 100× smaller — while also improving truthfulness and cutting toxic generations, with little regression on standard NLP benchmarks.
Where it runs out of road
RLHF is bounded by what humans can evaluate. Its known failure modes:
- Reward-model overoptimization (Goodharting). Push PPO hard enough and the policy exploits gaps in the RM, scoring high reward while degrading on true quality — the KL penalty mitigates but doesn't eliminate this.
- Sycophancy. Optimizing for human approval rewards answers that sound agreeable or confident over answers that are correct.
- Evaluation ceiling. On tasks too hard for labelers to judge, the preference signal becomes unreliable — and it does nothing to surface deceptive alignment, where a model behaves well only while supervised.
These limits motivate scalable oversight and approaches that reduce the human-labeling bottleneck, such as Constitutional AI / RLAIF. The technique's roots trace to Deep RL from Human Preferences (Christiano et al., 2017) ↗.