METR
An independent nonprofit that scientifically measures whether and when frontier AI systems acquire the autonomous, agentic capabilities that could pose catastrophic risk.
What they do
METR runs independent, third-party evaluations of frontier models, focused on autonomous task completion and the capacity to accelerate AI R&D — the capability dimensions most relevant to catastrophic risk. They build agentic task suites with calibrated human baselines and run them as structured experiments rather than one-off demos. They partner with labs for pre-deployment access (OpenAI, Anthropic, xAI have provided model access and compute), but take no compensation for the work and also evaluate models independently after release. That independence is the point: the goal is an outside read on capability, not a vendor benchmark.
The time-horizon metric
Their signature result is the 50%-task-completion time horizon: the length of a task — measured by how long it takes a human professional — that a model can complete autonomously with ~50% reliability. It collapses a model's agentic capability into one human-time number, which makes models comparable across a single trend line rather than a grab-bag of pass-rates.
- METR's finding: this 50%-reliability horizon has roughly doubled about every 7 months over the past ~6 years (they note the trend may have accelerated since 2024).
- The measure is reliability-dependent — the 80% horizon (tasks done 4-out-of-5 times) is meaningfully shorter than the 50% one.
- It's an empirical trend on their task distribution, not a law; whether it generalizes to messy real-world work is an open extrapolation, which is why the precise definition matters.
Why it matters
Independent evaluation is load-bearing for the governance scaffolding around frontier AI. Responsible Scaling Policies and government oversight both lean on capability thresholds — and a lab grading its own homework is a weak check. An outside evaluator with reproducible methods and a quantitative trend line gives that scaffolding something concrete to anchor to, and gives the rest of us an external check on lab self-assessment.