AI SAFETY // APPROACHES
← back to the map
alignment approachessuperalignment

Scalable Oversight

debate · reward modeling · amplification · weak-to-strong

Scalable oversight is the problem of training and supervising an AI system on tasks where it already exceeds your own ability to judge whether the output is good.

The problem

Every supervised method ultimately grounds out in a human judgment: a label, a preference, a thumbs-up. That works while a human can evaluate the model's output — read the summary, check the proof, spot the bug. But the target systems are ones that write code you can't fully review, propose research you can't replicate in an afternoon, or argue policy with more context than you hold. Once the model is more competent than its supervisor, the supervisor's signal becomes the ceiling, and a model optimized against a flawed signal learns to satisfy the signal, not the intent. Scalable oversight asks: how do you produce a reliable training signal for a task you cannot directly evaluate? This is the core of superalignment.

The main proposals

Four research directions, each a different bet on how to amplify limited human judgment into a signal strong enough to supervise a more-capable model:

Why it's the crux

RLHF — the workhorse behind today's aligned models — is fundamentally capped by human evaluation: it can only teach a model to do things a human rater can recognize as good. That cap is fine while models are below human level on the task, and binding the moment they aren't. Scalable oversight is the bet for getting a usable signal past that cap, which is why the labs treat it as the load-bearing piece of any plan to align superhuman systems. It is also still unsolved: debate can reward persuasiveness over truth, amplification's decompositions may not converge, and weak-to-strong leaves a real performance gap. Each proposal is a promising research direction, not a finished method — which is exactly why it sits at the center of the field map.