the canonpaper · 2023

Weak-to-Strong Generalization

Burns et al. (OpenAI) · 2023

PaperWeak-to-Strong Generalization (arXiv) ↗

An empirical study of whether a weak supervisor can elicit the latent capabilities of a stronger model — a tractable stand-in for the future problem of humans overseeing superhuman AI.

The setup

The core experiment is a deliberate capability inversion. Instead of using ground-truth labels, the authors:

Fine-tune a small, weak model (e.g. GPT-2-scale) on the ground-truth task to act as the supervisor.
Generate labels from that weak supervisor — labels that contain its mistakes.
Fine-tune a much stronger model (up to GPT-4-scale) only on those flawed weak labels.

They evaluate across NLP classification, chess puzzles, and reward modeling, scoring with the performance gap recovered (PGR): the fraction of the gap between weak-supervisor and strong-ceiling (ground-truth-trained) performance that the weakly-supervised strong model actually closes. 0% means it learned nothing beyond the weak teacher; 100% means it matched a strong model trained on real labels.

Why it's an alignment analogy

The relation weak supervisor : strong model is proposed as an analogy for humans : future superhuman models. When AI exceeds human ability on a task, our supervision signals become the "weak labels" — error-prone, incomplete, sometimes systematically wrong. The question of whether a strong model can still do the right thing despite imperfect human oversight is the central problem of scalable oversight and superalignment. This paper turns that abstract worry into a measurable benchmark you can run today.

The findings

The results are encouraging but explicitly partial — the authors are careful not to overclaim:

Strong models generalize beyond their weak supervisor's errors: weakly-supervised GPT-4-scale models often beat the weak teacher, recovering a substantial fraction of the gap on NLP tasks rather than just imitating the mistakes.
Recovery is far from complete, especially for reward modeling, and naive fine-tuning leaves a lot of latent capability unrecovered.
Simple auxiliary methods help — e.g. an auxiliary confidence loss that encourages the strong model to make confident predictions even when they disagree with the weak labels — sometimes lifting PGR substantially.
This is a research testbed, not a solution. The analogy has known disanalogies, the gains are method- and task-dependent, and weak-to-strong generalization is presented as a problem to make progress on, not one that is solved.