AI SAFETY // CANON
← back to the map
the canonpaper · 2023

Weak-to-Strong Generalization

Burns et al. (OpenAI) · 2023

An empirical study of whether a weak supervisor can elicit the latent capabilities of a stronger model — a tractable stand-in for the future problem of humans overseeing superhuman AI.

The setup

The core experiment is a deliberate capability inversion. Instead of using ground-truth labels, the authors:

They evaluate across NLP classification, chess puzzles, and reward modeling, scoring with the performance gap recovered (PGR): the fraction of the gap between weak-supervisor and strong-ceiling (ground-truth-trained) performance that the weakly-supervised strong model actually closes. 0% means it learned nothing beyond the weak teacher; 100% means it matched a strong model trained on real labels.

Why it's an alignment analogy

The relation weak supervisor : strong model is proposed as an analogy for humans : future superhuman models. When AI exceeds human ability on a task, our supervision signals become the "weak labels" — error-prone, incomplete, sometimes systematically wrong. The question of whether a strong model can still do the right thing despite imperfect human oversight is the central problem of scalable oversight and superalignment. This paper turns that abstract worry into a measurable benchmark you can run today.

The findings

The results are encouraging but explicitly partial — the authors are careful not to overclaim: