the canonpaper · 2022

Constitutional AI

Bai et al. (Anthropic) · 2022

PaperConstitutional AI: Harmlessness from AI Feedback (arXiv) ↗

Align a model to a short written set of principles — a constitution — by having the model supply its own harmlessness feedback, so humans no longer need to label every harmful prompt by hand.

How it works

Two phases turn a helpful-but-unaligned model into a harmless one using almost no human harm labels:

Supervised stage (critique & revise). Prompt the model on red-team inputs, then ask it to critique its own response against a sampled constitutional principle and rewrite it. Finetune on the revised answers. This shifts the model's distribution toward safe outputs before any RL.
RL stage (RLAIF). The model generates pairs of responses and labels which is more harmless — again guided by the constitution. Those AI preferences train a preference (reward) model, which then drives standard RL. Human feedback still supplies helpfulness; harmlessness comes from AI feedback.

Why it mattered

It scales harmlessness without a human in the loop for every harmful case, and it makes the target values explicit and editable — a few dozen written principles instead of a frozen blob of crowdworker labels. It also reduced the helpful/harmless tradeoff: models could refuse less evasively and explain their objections. CAI became a public, named basis for how Claude is trained.

Limits & critiques

The method is only as good as the constitution plus the base model's own judgment — if the model misreads a principle, the error propagates with no human checkpoint. Who writes the constitution, and whose values it encodes, is a governance question the technique doesn't answer. And it targets behavioral harmlessness on observed prompts; it is not a defense against deceptive alignment or a model that has learned to look harmless while pursuing other goals.