Mechanistic Interpretability
Mechanistic interpretability is the project of treating a trained network's weights as a compiled program and decompiling it — recovering the human-understandable algorithms it learned, rather than just observing what it outputs.
The goal
Behavioral testing tells you what a model does on the inputs you tried; it says nothing about the mechanism, so it can't rule out a model that behaves on your evals and misbehaves elsewhere. Mechanistic interpretability aims for the inside view: identify the actual internal computation — which units represent what, and how they combine — so claims about a model rest on its wiring, not on a finite sample of its behavior.
Core concepts
The Zoom In agenda rests on three claims that, if true, make the network legible:
- Features — meaningful directions in activation space (a curve detector, a "dog ear," a sentiment direction). The bet is that these are real units of computation, not analyst projections. In practice features are often tangled by superposition, where a layer packs more features than it has neurons.
- Circuits — computational subgraphs that connect features through the weights, implementing a legible algorithm (e.g. assembling curves into a circle). See circuits for the worked examples.
- Universality — analogous features and circuits recur across different models and tasks, so findings generalize rather than being one-offs.
Why it matters for safety
If hidden goals or deceptive alignment leave a mechanistic fingerprint, circuit-level analysis could surface it even when behavioral evals are clean — an inside-view check on whether a model's stated reasoning matches its actual computation. That is the load-bearing safety case: a lie detector that reads the wiring, not the answers.
The honest status
This is promising but immature. Most rigorous results are on small models or narrow circuits; no one can currently decompile a frontier model end-to-end, and coverage of any large model is partial and labor-intensive. Superposition and scale remain open obstacles. Treat mechanistic interpretability as a serious bet on a hard problem — a direction with real early wins, not a deployed audit you can rely on today.