AI SAFETY // INTERPRETABILITY
← back to the map
interpretabilityAnthropic · 2021–22

Circuits & Induction Heads

Elhage, Olsson et al. (Anthropic) · 2021–22

A reverse-engineering toolkit that treats a transformer not as a black box but as a sum of small, human-legible algorithms you can read off the weights — and a concrete one it found, the induction head.

The framework

The core move is to view the model as a residual stream: a shared communication bus that every layer reads from and writes to. Attention heads and MLPs don't transform the stream in place — they add their outputs into it, so the whole network decomposes into independent end-to-end paths from tokens to logits. Each attention head factors into two near-independent pieces:

Induction heads

The framework's flagship discovery, and the simplest non-trivial circuit it isolates. An induction head implements a two-head algorithm: find the previous place the current token appeared, then copy whatever came right after it. Formally, given a sequence matching [A][B]…[A], it raises the probability of [B] next. It first appears only in 2-layer models — one head writes "the token before me was X," a second head uses that to attend back and copy forward. The follow-up work, In-context Learning and Induction Heads (Olsson et al., 2022) ↗, argues these heads are a primary mechanism behind in-context learning: they form in a sharp training phase change that coincides with the model suddenly getting good at few-shot tasks, and ablating them degrades that ability.

Why it matters

This is the existence proof for mechanistic interpretability: it showed that real, human-understandable algorithms can be recovered from raw learned weights — not just correlated with behavior, but identified and ablation-tested. It grounds the claim that models can in principle be audited at the level of "what computation is this doing," which is what safety needs from interpretability. The caveat is scale: these circuits were mapped in tiny (1–2 layer, toy) models. Reading a frontier model the same way — with superposition, polysemantic neurons, and millions of interacting features — remains the open hard problem.