evals & dangerous capabilitiesfrontier safety policy

Responsible Scaling

RSP · Preparedness · Frontier Safety Framework

PolicyAnthropic Responsible Scaling Policy ↗

Responsible scaling policies are voluntary commitments where a lab ties model capability thresholds to required safeguards, and pauses further scaling until those safeguards are in place.

The core idea

The mechanism is an "if-then" commitment. The lab runs evaluations on a model for specific dangerous capabilities — uplift to bioweapons or cyberattacks, autonomous self-replication, undermining oversight. Each policy defines capability thresholds and the safeguards that must be live before a model crossing one is trained or deployed: hardened security against weight theft, deployment filters, access controls, alignment evidence. If a model reaches a threshold and the matching safeguards aren't ready, the commitment is to hold — pause or restrict — until they are. Safety requirements escalate as capabilities climb, so the policy is a ladder rather than a single gate. In practice the binding question is rarely the threshold itself but whether the evals are sharp enough to detect when a model has crossed it.

The three frameworks

All three labs converged on the same shape; the names and tiers differ.

Anthropic — Responsible Scaling Policy (RSP). Uses AI Safety Levels (ASL-1…ASL-4+), modeled on biosafety levels. Each level pairs a capability bar with required security and deployment standards; current frontier models sit around ASL-2/ASL-3. anthropic.com ↗
OpenAI — Preparedness Framework. Defines Tracked Categories (Biological/Chemical, Cybersecurity, AI Self-improvement) with High and Critical capability levels, plus research categories for emerging risks; safeguards must "sufficiently minimize" risk before deployment. openai.com ↗
Google DeepMind — Frontier Safety Framework (FSF). Defines Critical Capability Levels (CCLs) across domains like cyber, CBRN, autonomy, ML R&D, and harmful manipulation; reaching a CCL triggers a safety-case review before launch. deepmind.google ↗

Strengths & limits

The strength is that these turn open-ended "we take safety seriously" into concrete, conditional, pre-committed actions: a named eval, a named threshold, a named safeguard, made public in advance and harder to walk back quietly. They create some shared vocabulary across labs and a "race to the top" pressure where competitors can be measured against each other.

The limits are equally real and worth stating plainly. These are voluntary and self-enforced — each lab writes its own thresholds, runs its own evals, judges its own compliance, and can revise the policy (some revisions have loosened commitments). A framework is only as good as the evals behind it: if a dangerous capability is present but the eval doesn't surface it, the threshold never trips. "Pause" depends on a lab choosing to halt under competitive pressure, with no external audit or enforcement. Treat them as genuine, useful self-governance — not as a regulatory backstop.