AI SAFETY // CANON
← back to the map
the canonpaper · 2022

Constitutional AI

Bai et al. (Anthropic) · 2022

Align a model to a short written set of principles — a constitution — by having the model supply its own harmlessness feedback, so humans no longer need to label every harmful prompt by hand.

How it works

Two phases turn a helpful-but-unaligned model into a harmless one using almost no human harm labels:

Why it mattered

It scales harmlessness without a human in the loop for every harmful case, and it makes the target values explicit and editable — a few dozen written principles instead of a frozen blob of crowdworker labels. It also reduced the helpful/harmless tradeoff: models could refuse less evasively and explain their objections. CAI became a public, named basis for how Claude is trained.

Limits & critiques

The method is only as good as the constitution plus the base model's own judgment — if the model misreads a principle, the error propagates with no human checkpoint. Who writes the constitution, and whose values it encodes, is a governance question the technique doesn't answer. And it targets behavioral harmlessness on observed prompts; it is not a defense against deceptive alignment or a model that has learned to look harmless while pursuing other goals.