Dangerous-Capability Evals
Dangerous-capability evals are structured tests that measure what a frontier model could do if misused — and whether it would — turning vague worries about extreme risk into reproducible numbers.
What they measure
Two orthogonal questions. Capability: what damage could the model enable if a malicious actor pushed it to its limit? Alignment / propensity: left to its own behavior, would it actually try? A model can be dangerous on either axis — capable-but-compliant is still a misuse risk; willing-but-incompetent is not yet. The threat categories Shevlane et al. flag, refined in later empirical work:
- Cyber-offense — finding and exploiting vulnerabilities, writing malware, autonomous intrusion.
- Bio / CBRN uplift — lowering the expertise barrier to chemical, biological, radiological, or nuclear harm.
- Persuasion & manipulation — deception, social engineering, moving a human toward a target belief or action.
- Autonomy & self-replication — self-reasoning, acquiring resources, copying itself across machines (self-proliferation).
How they're run
The hard part is capability elicitation: you measure the model's ceiling, not its lazy default. That means fine-tuning, agentic scaffolding, tool access, prompt engineering, and best-of-n sampling — anything a determined adversary would do — so a refusal or a weak first answer doesn't read as "safe." Threats with a human in the loop add uplift trials: does a person with model access outperform one with only a search engine? The empirical companion, Phuong et al.'s Evaluating Frontier Models for Dangerous Capabilities ↗, runs this battery across persuasion, cyber-security, self-proliferation, and self-reasoning on Gemini 1.0 — finding no strong dangerous capability yet, but early warning signs.
Why they matter
These evals are the measurement layer under Responsible Scaling Policies, preparedness frameworks, and government safety-institute testing: capability thresholds gate deployment and trigger mitigations. They convert "is it safe?" into graded numbers a release decision can hang on. The structural caveat — the elicitation gap — never goes away: a passing eval proves only that your elicitation failed to surface the capability. A better prompt, a fine-tune, or new scaffolding can expose it later. You can demonstrate a capability is present; you cannot prove its absence. Treat negative results as upper bounds with an expiry date, not clean bills of health.