Constitutional AI
Align a model to a short written set of principles — a constitution — by having the model supply its own harmlessness feedback, so humans no longer need to label every harmful prompt by hand.
How it works
Two phases turn a helpful-but-unaligned model into a harmless one using almost no human harm labels:
- Supervised stage (critique & revise). Prompt the model on red-team inputs, then ask it to critique its own response against a sampled constitutional principle and rewrite it. Finetune on the revised answers. This shifts the model's distribution toward safe outputs before any RL.
- RL stage (RLAIF). The model generates pairs of responses and labels which is more harmless — again guided by the constitution. Those AI preferences train a preference (reward) model, which then drives standard RL. Human feedback still supplies helpfulness; harmlessness comes from AI feedback.
Why it mattered
It scales harmlessness without a human in the loop for every harmful case, and it makes the target values explicit and editable — a few dozen written principles instead of a frozen blob of crowdworker labels. It also reduced the helpful/harmless tradeoff: models could refuse less evasively and explain their objections. CAI became a public, named basis for how Claude is trained.
Limits & critiques
The method is only as good as the constitution plus the base model's own judgment — if the model misreads a principle, the error propagates with no human checkpoint. Who writes the constitution, and whose values it encodes, is a governance question the technique doesn't answer. And it targets behavioral harmlessness on observed prompts; it is not a defense against deceptive alignment or a model that has learned to look harmless while pursuing other goals.