RED TEAM // FOUNDATIONS
← back to the map
ml foundationswhy models scale

Scaling Laws

Kaplan et al. 2020 · Hoffmann et al. (Chinchilla) 2022

Model capability is not magic — it's a smooth, predictable power law in three inputs: parameters, data, and compute. Push any of them up and test loss falls along a straight line on a log-log plot. This is the single most important "why" behind everything you'll be red-teaming.

The core finding (Kaplan 2020)

OpenAI showed that cross-entropy test loss scales as a power law with model size (N), dataset size (D), and training compute (C), across seven orders of magnitude — and that architecture details (depth vs width) matter far less than raw scale. Two consequences that still shape the field:

The correction (Chinchilla 2022)

DeepMind found Kaplan's recipe over-weighted parameters and under-trained on data. For a fixed compute budget, params and tokens should grow roughly in equal proportion — about 20 training tokens per parameter. The proof: Chinchilla (70B) beat Gopher (280B) — a model 4× smaller, trained on more data, won. The whole industry re-tuned its model/data ratios overnight, and "compute-optimal" became the default framing.

Why a red teamer cares

Where the curve bends next: pre-training scaling is hitting data limits, so the frontier has shifted to inference-time / test-time compute (reasoning models that "think" longer). The scaling mindset is the same — more compute, more capability, predictably — but the knob moved from training to inference, and with it, the attack surface.