interpretabilitythe program

Mechanistic Interpretability

reverse-engineering the weights

SourceOlah et al., “Zoom In: An Introduction to Circuits” (Distill, 2020) ↗

Mechanistic interpretability is the project of treating a trained network's weights as a compiled program and decompiling it — recovering the human-understandable algorithms it learned, rather than just observing what it outputs.

The goal

Behavioral testing tells you what a model does on the inputs you tried; it says nothing about the mechanism, so it can't rule out a model that behaves on your evals and misbehaves elsewhere. Mechanistic interpretability aims for the inside view: identify the actual internal computation — which units represent what, and how they combine — so claims about a model rest on its wiring, not on a finite sample of its behavior.

Core concepts

The Zoom In agenda rests on three claims that, if true, make the network legible:

Features — meaningful directions in activation space (a curve detector, a "dog ear," a sentiment direction). The bet is that these are real units of computation, not analyst projections. In practice features are often tangled by superposition, where a layer packs more features than it has neurons.
Circuits — computational subgraphs that connect features through the weights, implementing a legible algorithm (e.g. assembling curves into a circle). See circuits for the worked examples.
Universality — analogous features and circuits recur across different models and tasks, so findings generalize rather than being one-offs.

Why it matters for safety

If hidden goals or deceptive alignment leave a mechanistic fingerprint, circuit-level analysis could surface it even when behavioral evals are clean — an inside-view check on whether a model's stated reasoning matches its actual computation. That is the load-bearing safety case: a lie detector that reads the wiring, not the answers.

The honest status

This is promising but immature. Most rigorous results are on small models or narrow circuits; no one can currently decompile a frontier model end-to-end, and coverage of any large model is partial and labor-intensive. Superposition and scale remain open obstacles. Treat mechanistic interpretability as a serious bet on a hard problem — a direction with real early wins, not a deployed audit you can rely on today.