interpretabilitydictionary learning

Sparse Autoencoders

Anthropic · Towards / Scaling Monosemanticity · 2023–24

SourceTowards Monosemanticity (Anthropic) ↗

A sparse autoencoder (SAE) is a dictionary-learning method that decomposes a model's dense activations into a much larger set of sparse, mostly monosemantic features — pulling concepts back out of superposition.

The idea

Neurons are polysemantic: a single activation dimension fires for many unrelated concepts because the model packs more features than it has dimensions (superposition). An SAE attacks this directly:

Train an autoencoder on a layer's activations (e.g. the residual stream) with an overcomplete hidden layer — far more units than input dimensions.
Enforce sparsity (an L1 penalty), so only a handful of hidden units fire per input. Each one learns to specialize.
Reconstruct the original activation from that sparse code. The learned dictionary directions are the features.
The 2023 paper trained on ~8B activations from a one-layer transformer, learning thousands of features; human raters judged a large majority as cleanly interpretable single concepts (Arabic script, DNA, specific tokens).

What it found

Learned features map to human-legible concepts, and you can verify them: a feature activates on the concept's instances and on abstract discussion of it. Crucially, features are causal handles — clamping one up or down ("feature steering") changes generation in the predicted direction.

Scaling Monosemanticity (2024) showed this works on a production model — Claude 3 Sonnet — training SAEs with up to ~34M features on a middle-layer residual stream. It surfaced multilingual, multimodal features and, notably, safety-relevant ones: deception, power-seeking, sycophancy, bias, and unsafe code. The infamous "Golden Gate Bridge" feature came from this work.

Why it matters & limits

SAEs are the current workhorse for turning opaque activations into a vocabulary of named, steerable directions — the first scalable bridge from "a vector fired" to "the model is thinking about X." But be precise about the limits:

Coverage is partial. A dictionary captures the features it was given capacity and data to find; rare or fine-grained concepts get missed or split across multiple units.
Features can be incomplete or imperfect — reconstruction error means some signal is dropped, and "monosemantic" is a strong tendency, not a guarantee.
Not yet a full audit. Finding a deception feature is not the same as proving you've found all deception-related computation, or that steering it is robust. SAEs locate directions; they don't certify behavior.