interpretabilityAnthropic · 2022

Superposition

Elhage et al. (Anthropic) · Toy Models of Superposition

SourceToy Models of Superposition (Anthropic) ↗

Superposition is how a network represents more features than it has neurons — by encoding features as nearly-orthogonal directions that share dimensions, which makes individual neurons polysemantic and resistant to one-neuron-at-a-time interpretation.

The phenomenon

If you assumed one neuron = one concept, an n-neuron layer could only hold n features. Networks routinely beat that bound. They do it by packing m > n features into the activation space as distinct directions. With only n dimensions you can't fit m mutually orthogonal vectors, so the directions are only nearly orthogonal — they overlap a little, creating interference. The visible symptom is polysemanticity:

A single neuron fires for several unrelated features (e.g. an academic-citation pattern and a Korean-text pattern), because multiple feature directions project onto it.
Meaning lives in directions across the population, not in axis-aligned neurons — so reading a model neuron-by-neuron is reading the wrong basis.

Why it happens

The lever is sparsity: real features fire rarely, so two features sharing a direction seldom activate at once, and the interference cost stays low. Under sparsity the network trades a little noise for a lot of capacity — Elhage et al. show you can pack exponentially many sparse features into a fixed space. The toy model makes the tradeoff legible: sweeping feature sparsity reveals a clean phase change. Dense features get dedicated orthogonal dimensions (no superposition); past a sparsity threshold the model abruptly switches to packing features in superposition, arranging them in structured geometric configurations (e.g. antipodal pairs, pentagons) to minimize interference.

Why it's the central obstacle

Superposition is why naive interpretability fails. If neurons were monosemantic you could label them and read the model off directly; because they're polysemantic, no neuron-level story is faithful — the features you want are linear combinations hidden in the activations. This is the core motivation for sparse autoencoders, which learn an overcomplete, sparse basis to un-mix superposition back into monosemantic features you can actually name. See Sparse Autoencoders.