evals & dangerous capabilitiesthe third party

METR

Model Evaluation & Threat Research

An independent nonprofit that scientifically measures whether and when frontier AI systems acquire the autonomous, agentic capabilities that could pose catastrophic risk.

What they do

METR runs independent, third-party evaluations of frontier models, focused on autonomous task completion and the capacity to accelerate AI R&D — the capability dimensions most relevant to catastrophic risk. They build agentic task suites with calibrated human baselines and run them as structured experiments rather than one-off demos. They partner with labs for pre-deployment access (OpenAI, Anthropic, xAI have provided model access and compute), but take no compensation for the work and also evaluate models independently after release. That independence is the point: the goal is an outside read on capability, not a vendor benchmark.

The time-horizon metric

Their signature result is the 50%-task-completion time horizon: the length of a task — measured by how long it takes a human professional — that a model can complete autonomously with ~50% reliability. It collapses a model's agentic capability into one human-time number, which makes models comparable across a single trend line rather than a grab-bag of pass-rates.

METR's finding: this 50%-reliability horizon has roughly doubled about every 7 months over the past ~6 years (they note the trend may have accelerated since 2024).
The measure is reliability-dependent — the 80% horizon (tasks done 4-out-of-5 times) is meaningfully shorter than the 50% one.
It's an empirical trend on their task distribution, not a law; whether it generalizes to messy real-world work is an open extrapolation, which is why the precise definition matters.

Why it matters

Independent evaluation is load-bearing for the governance scaffolding around frontier AI. Responsible Scaling Policies and government oversight both lean on capability thresholds — and a lab grading its own homework is a weak check. An outside evaluator with reproducible methods and a quantitative trend line gives that scaffolding something concrete to anchor to, and gives the rest of us an external check on lab self-assessment.