AI SAFETY // INTERPRETABILITY
← back to the map
interpretabilitythe program

Mechanistic Interpretability

reverse-engineering the weights

Mechanistic interpretability is the project of treating a trained network's weights as a compiled program and decompiling it — recovering the human-understandable algorithms it learned, rather than just observing what it outputs.

The goal

Behavioral testing tells you what a model does on the inputs you tried; it says nothing about the mechanism, so it can't rule out a model that behaves on your evals and misbehaves elsewhere. Mechanistic interpretability aims for the inside view: identify the actual internal computation — which units represent what, and how they combine — so claims about a model rest on its wiring, not on a finite sample of its behavior.

Core concepts

The Zoom In agenda rests on three claims that, if true, make the network legible:

Why it matters for safety

If hidden goals or deceptive alignment leave a mechanistic fingerprint, circuit-level analysis could surface it even when behavioral evals are clean — an inside-view check on whether a model's stated reasoning matches its actual computation. That is the load-bearing safety case: a lie detector that reads the wiring, not the answers.

The honest status

This is promising but immature. Most rigorous results are on small models or narrow circuits; no one can currently decompile a frontier model end-to-end, and coverage of any large model is partial and labor-intensive. Superposition and scale remain open obstacles. Treat mechanistic interpretability as a serious bet on a hard problem — a direction with real early wins, not a deployed audit you can rely on today.