GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

📅 2025-10-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of rigorous evaluation of AI models’ capabilities in real-world economic contexts. We introduce GDPval, the first task benchmark grounded in actual economic value—covering core work activities across the nine U.S. industries and 44 occupations contributing most to GDP. Methodologically, we pioneer a systematic approach that anchors AI capability assessment to tasks with quantifiable economic outputs; task definitions are rigorously established by domain experts, yielding a high-quality, open-source task suite and an automated scoring service. To enhance model performance, we innovatively integrate context augmentation, reasoning-process expansion, and task scaffolding techniques. Empirical results demonstrate near-linear performance gains across state-of-the-art large language models; the current top-performing model achieves output quality approaching that of human experts, and—under human supervision—enables substantial cost reduction and productivity improvement.

Technology Category

Application Category

📝 Abstract
We introduce GDPval, a benchmark evaluating AI model capabilities on real-world economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open-source a gold subset of 220 tasks and provide a public automated grading service at evals.openai.com to facilitate future research in understanding real-world model capabilities.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI model performance on economically valuable real-world tasks
Assessing frontier models' capabilities across major GDP-contributing occupations
Analyzing AI-human collaboration potential for cost-effective task execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates AI on real-world economic tasks
Tasks constructed from industry professionals' representative work
Open-sources subset and provides automated grading service
T
Tejal Patwardhan
OpenAI
R
Rachel Dias
OpenAI
E
Elizabeth Proehl
OpenAI
G
Grace Kim
OpenAI
M
Michele Wang
OpenAI
Olivia Watkins
Olivia Watkins
UC Berkeley
reinforcement learningimitation learningcomputer visiondeep learning
S
Simón Posada Fishman
OpenAI
M
Marwan Aljubeh
OpenAI
P
Phoebe Thacker
OpenAI
L
Laurance Fauconnet
OpenAI
N
Natalie S. Kim
OpenAI
Patrick Chao
Patrick Chao
OpenAI
S
Samuel Miserendino
OpenAI
G
Gildas Chabot
OpenAI
D
David Li
OpenAI
M
Michael Sharman
OpenAI
A
Alexandra Barr
OpenAI
A
Amelia Glaese
OpenAI
Jerry Tworek
Jerry Tworek
OpenAI