AI SAFETY // CANON
← back to the map
the canonpaper · 2018

AI Safety via Debate

Irving, Christiano, Amodei · 2018

Train two agents to argue opposite sides of a question and let a human judge pick the winner — betting that checking an argument is easier than producing the right answer, so a weaker judge can still supervise a stronger system.

The idea

Two AI agents play a zero-sum game: they take turns making statements about a question or proposed action, and a human judge declares which one was more truthful and useful. The agents are trained by self-play to win that game. The paper's load-bearing intuition is complexity-theoretic — under optimal play, debate with a polynomial-time judge can decide questions in PSPACE, far more than the judge could answer alone (NP-style "show me the answer"). Adversarial pressure is what does the work: a dishonest agent's best move is exploitable by an honest opponent who can point the judge at the flaw, so the judge only has to evaluate the local step where the disagreement bottoms out, not reconstruct the whole answer.

Why it's a scalable-oversight proposal

Scalable oversight asks: how do you train a system that knows more than its supervisor? Debate is one answer — the judge never has to generate the correct answer, only adjudicate a contested chain produced by experts incentivized to expose each other.

The open problems

The mechanism only works if the judge is reliably swayed toward truth — and that assumption is fragile:

Why it's in the canon

Debate framed scalable oversight as a game-theoretic problem and gave it a crisp complexity-theory backbone, sitting alongside IDA and recursive reward modeling as a foundational proposal for supervising superhuman systems. It remains actively researched — both its failure modes (obfuscated arguments) and its empirical viability (LLM-judge debate experiments) are live questions a decade of follow-up work is still chasing.