ml foundationswhy models scale

Scaling Laws

Kaplan et al. 2020 · Hoffmann et al. (Chinchilla) 2022

Papers Scaling Laws for Neural Language Models (Kaplan 2020) ↗

Training Compute-Optimal LLMs / "Chinchilla" (Hoffmann 2022) ↗

Model capability is not magic — it's a smooth, predictable power law in three inputs: parameters, data, and compute. Push any of them up and test loss falls along a straight line on a log-log plot. This is the single most important "why" behind everything you'll be red-teaming.

The core finding (Kaplan 2020)

OpenAI showed that cross-entropy test loss scales as a power law with model size (N), dataset size (D), and training compute (C), across seven orders of magnitude — and that architecture details (depth vs width) matter far less than raw scale. Two consequences that still shape the field:

Bigger is more sample-efficient. Larger models reach a given loss with fewer tokens — the opposite of the old "small model + lots of data" intuition.
Loss is forecastable. You can predict a bigger model's performance before you train it. This is why labs commit hundreds of millions to a single run.

The correction (Chinchilla 2022)

DeepMind found Kaplan's recipe over-weighted parameters and under-trained on data. For a fixed compute budget, params and tokens should grow roughly in equal proportion — about 20 training tokens per parameter. The proof: Chinchilla (70B) beat Gopher (280B) — a model 4× smaller, trained on more data, won. The whole industry re-tuned its model/data ratios overnight, and "compute-optimal" became the default framing.

Why a red teamer cares

Capability is a moving target. The model you harden today is a checkpoint on a curve. Threat models must assume next-gen scale, not just current behavior.
Emergence. Some capabilities — and some attack surfaces (tool use, multi-step reasoning, code-exec) — appear abruptly past a scale threshold rather than fading in. What's safe at 7B can be exploitable at 70B.
Inverse scaling. A minority of behaviors get worse with scale (sycophancy, certain prompt-injection susceptibilities). Bigger ≠ safer by default.
Data scaling is an attack surface. Compute-optimal training means models are hungry for enormous, loosely-curated corpora — which is exactly the door for data & model poisoning and supply-chain risk.

Where the curve bends next: pre-training scaling is hitting data limits, so the frontier has shifted to inference-time / test-time compute (reasoning models that "think" longer). The scaling mindset is the same — more compute, more capability, predictably — but the knob moved from training to inference, and with it, the attack surface.