Open-World Evaluations for Measuring Frontier AI Capabilities

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Current AI benchmarks struggle to accurately assess model capabilities in real-world, complex, and long-horizon tasks, often leading to over- or underestimation of performance. This work proposes a novel “open-world evaluation” paradigm centered on realistic, intricate tasks—such as publishing an app on the iOS App Store—and integrates few-shot qualitative analysis, long-horizon task design, and human-AI collaboration mechanisms to overcome the limitations of conventional automated benchmarks. The CRUX project, built upon this framework, enables early detection of shifts in AI capabilities; in experimental validation, an AI agent successfully completed the entire App Store submission process with only a single, avoidable human intervention, demonstrating the framework’s effectiveness and forward-looking potential.

📝 Abstract

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

Problem

Research questions and friction points this paper is trying to address.

open-world evaluations

frontier AI capabilities

benchmark limitations

real-world tasks

AI evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-world evaluations

frontier AI

CRUX