Sparse Autoencoders
A sparse autoencoder (SAE) is a dictionary-learning method that decomposes a model's dense activations into a much larger set of sparse, mostly monosemantic features — pulling concepts back out of superposition.
The idea
Neurons are polysemantic: a single activation dimension fires for many unrelated concepts because the model packs more features than it has dimensions (superposition). An SAE attacks this directly:
- Train an autoencoder on a layer's activations (e.g. the residual stream) with an overcomplete hidden layer — far more units than input dimensions.
- Enforce sparsity (an L1 penalty), so only a handful of hidden units fire per input. Each one learns to specialize.
- Reconstruct the original activation from that sparse code. The learned dictionary directions are the features.
- The 2023 paper trained on ~8B activations from a one-layer transformer, learning thousands of features; human raters judged a large majority as cleanly interpretable single concepts (Arabic script, DNA, specific tokens).
What it found
Learned features map to human-legible concepts, and you can verify them: a feature activates on the concept's instances and on abstract discussion of it. Crucially, features are causal handles — clamping one up or down ("feature steering") changes generation in the predicted direction.
Scaling Monosemanticity (2024) showed this works on a production model — Claude 3 Sonnet — training SAEs with up to ~34M features on a middle-layer residual stream. It surfaced multilingual, multimodal features and, notably, safety-relevant ones: deception, power-seeking, sycophancy, bias, and unsafe code. The infamous "Golden Gate Bridge" feature came from this work.
Why it matters & limits
SAEs are the current workhorse for turning opaque activations into a vocabulary of named, steerable directions — the first scalable bridge from "a vector fired" to "the model is thinking about X." But be precise about the limits:
- Coverage is partial. A dictionary captures the features it was given capacity and data to find; rare or fine-grained concepts get missed or split across multiple units.
- Features can be incomplete or imperfect — reconstruction error means some signal is dropped, and "monosemantic" is a strong tendency, not a guarantee.
- Not yet a full audit. Finding a deception feature is not the same as proving you've found all deception-related computation, or that steering it is robust. SAEs locate directions; they don't certify behavior.