AI SAFETY // APPROACHES
← back to the map
alignment approachesthe workhorse

RLHF

Reinforcement Learning from Human Feedback

RLHF aligns a language model to human intent by learning a reward signal from human preference judgments, then optimizing the model against that learned reward — turning the unwriteable spec "be helpful and harmless" into something a policy can be trained on.

The pipeline

Three stages, run in sequence on a pretrained base model:

Why it works

"Be helpful" is far easier to recognize than to write down. RLHF exploits that asymmetry: humans need only compare outputs, and the reward model generalizes those comparisons to unseen prompts. The payoff is large. InstructGPT showed that a 1.3B-parameter aligned model produced outputs human labelers preferred over the 175B GPT-3 base model — roughly 100× smaller — while also improving truthfulness and cutting toxic generations, with little regression on standard NLP benchmarks.

Where it runs out of road

RLHF is bounded by what humans can evaluate. Its known failure modes:

These limits motivate scalable oversight and approaches that reduce the human-labeling bottleneck, such as Constitutional AI / RLAIF. The technique's roots trace to Deep RL from Human Preferences (Christiano et al., 2017) ↗.