AI SAFETY // EVALS
← back to the map
evals & dangerous capabilitiesmeasure the risk

Dangerous-Capability Evals

Shevlane, Phuong et al. · 2023–24

Dangerous-capability evals are structured tests that measure what a frontier model could do if misused — and whether it would — turning vague worries about extreme risk into reproducible numbers.

What they measure

Two orthogonal questions. Capability: what damage could the model enable if a malicious actor pushed it to its limit? Alignment / propensity: left to its own behavior, would it actually try? A model can be dangerous on either axis — capable-but-compliant is still a misuse risk; willing-but-incompetent is not yet. The threat categories Shevlane et al. flag, refined in later empirical work:

How they're run

The hard part is capability elicitation: you measure the model's ceiling, not its lazy default. That means fine-tuning, agentic scaffolding, tool access, prompt engineering, and best-of-n sampling — anything a determined adversary would do — so a refusal or a weak first answer doesn't read as "safe." Threats with a human in the loop add uplift trials: does a person with model access outperform one with only a search engine? The empirical companion, Phuong et al.'s Evaluating Frontier Models for Dangerous Capabilities ↗, runs this battery across persuasion, cyber-security, self-proliferation, and self-reasoning on Gemini 1.0 — finding no strong dangerous capability yet, but early warning signs.

Why they matter

These evals are the measurement layer under Responsible Scaling Policies, preparedness frameworks, and government safety-institute testing: capability thresholds gate deployment and trigger mitigations. They convert "is it safe?" into graded numbers a release decision can hang on. The structural caveat — the elicitation gap — never goes away: a passing eval proves only that your elicitation failed to surface the capability. A better prompt, a fine-tune, or new scaffolding can expose it later. You can demonstrate a capability is present; you cannot prove its absence. Treat negative results as upper bounds with an expiry date, not clean bills of health.