alignment approachessuperalignment

Scalable Oversight

debate · reward modeling · amplification · weak-to-strong

Key paperScalable agent alignment via reward modeling (Leike et al.) ↗

Scalable oversight is the problem of training and supervising an AI system on tasks where it already exceeds your own ability to judge whether the output is good.

The problem

Every supervised method ultimately grounds out in a human judgment: a label, a preference, a thumbs-up. That works while a human can evaluate the model's output — read the summary, check the proof, spot the bug. But the target systems are ones that write code you can't fully review, propose research you can't replicate in an afternoon, or argue policy with more context than you hold. Once the model is more competent than its supervisor, the supervisor's signal becomes the ceiling, and a model optimized against a flawed signal learns to satisfy the signal, not the intent. Scalable oversight asks: how do you produce a reliable training signal for a task you cannot directly evaluate? This is the core of superalignment.

The main proposals

Four research directions, each a different bet on how to amplify limited human judgment into a signal strong enough to supervise a more-capable model:

Recursive reward modeling — learn a reward model from human feedback, then use agents trained on it to help humans evaluate harder tasks, bootstrapping evaluation up the difficulty curve (Leike et al. 2018).
Debate — two models argue opposing sides of a question and a human judges the transcript; the bet is that exposing a lie is easier than telling one, so honest play wins (explainer · Irving et al. 2018).
Iterated amplification (IDA) — build a strong training signal by recursively decomposing a hard problem into easier subproblems a human (plus model assistants) can answer, then distilling the result (Christiano et al. 2018).
Weak-to-strong generalization — empirically test whether a weak supervisor can elicit the full capabilities of a stronger model, using small-model labels as a stand-in for human-on-superhuman supervision (explainer · Burns et al. 2023).

Why it's the crux

RLHF — the workhorse behind today's aligned models — is fundamentally capped by human evaluation: it can only teach a model to do things a human rater can recognize as good. That cap is fine while models are below human level on the task, and binding the moment they aren't. Scalable oversight is the bet for getting a usable signal past that cap, which is why the labs treat it as the load-bearing piece of any plan to align superhuman systems. It is also still unsolved: debate can reward persuasiveness over truth, amplification's decompositions may not converge, and weak-to-strong leaves a real performance gap. Each proposal is a promising research direction, not a finished method — which is exactly why it sits at the center of the field map.