AI SAFETY // INTERPRETABILITY
← back to the map
interpretabilitydictionary learning

Sparse Autoencoders

Anthropic · Towards / Scaling Monosemanticity · 2023–24

A sparse autoencoder (SAE) is a dictionary-learning method that decomposes a model's dense activations into a much larger set of sparse, mostly monosemantic features — pulling concepts back out of superposition.

The idea

Neurons are polysemantic: a single activation dimension fires for many unrelated concepts because the model packs more features than it has dimensions (superposition). An SAE attacks this directly:

What it found

Learned features map to human-legible concepts, and you can verify them: a feature activates on the concept's instances and on abstract discussion of it. Crucially, features are causal handles — clamping one up or down ("feature steering") changes generation in the predicted direction.

Scaling Monosemanticity (2024) showed this works on a production model — Claude 3 Sonnet — training SAEs with up to ~34M features on a middle-layer residual stream. It surfaced multilingual, multimodal features and, notably, safety-relevant ones: deception, power-seeking, sycophancy, bias, and unsafe code. The infamous "Golden Gate Bridge" feature came from this work.

Why it matters & limits

SAEs are the current workhorse for turning opaque activations into a vocabulary of named, steerable directions — the first scalable bridge from "a vector fired" to "the model is thinking about X." But be precise about the limits: