The AI Productivity Index (APEX)

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI benchmarks predominantly assess programming proficiency, exhibiting a critical gap in systematically evaluating high-economic-value knowledge work—such as investment banking, management consulting, legal practice, and primary healthcare. Method: We introduce APEX-v1.0, the first comprehensive benchmark for this domain, comprising 200 realistic, expert-designed tasks with fine-grained, rubric-based scoring. Leveraging an innovative “expert-defined tasks + LLM-based automated adjudication” paradigm, we conduct large-scale, reproducible evaluations across 23 state-of-the-art models. Contribution/Results: GPT-5 (Thinking=High) achieves the highest average accuracy (64.2%), while Qwen3-235B emerges as the top-performing open-weight model. Nevertheless, all models fall substantially short of human expert performance. APEX-v1.0 establishes the first standardized, scalable, and reproducible evaluation infrastructure for non-coding professional reasoning, advancing both technical development and responsible governance of AI in high-stakes knowledge-intensive domains.

Technology Category

Application Category

📝 Abstract
We introduce the first version of the AI Productivity Index (APEX), a benchmark for assessing whether frontier AI models can perform knowledge work with high economic value. APEX addresses one of the largest inefficiencies in AI research: outside of coding, benchmarks often fail to test economically relevant capabilities. APEX-v1.0 contains 200 test cases and covers four domains: investment banking, management consulting, law, and primary medical care. It was built in three steps. First, we sourced experts with top-tier experience e.g., investment bankers from Goldman Sachs. Second, experts created prompts that reflect high-value tasks in their day-to-day work. Third, experts created rubrics for evaluating model responses. We evaluate 23 frontier models on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) (60.4%). Qwen 3 235B is the best performing open-source model and seventh best overall. There is a large gap between the performance of even the best models and human experts, highlighting the need for better measurement of models' ability to produce economically valuable work.
Problem

Research questions and friction points this paper is trying to address.

APEX benchmark measures AI models' economic value in knowledge work
It addresses inefficiency in testing economically relevant AI capabilities
Evaluates performance gaps between AI and human experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

APEX benchmark tests AI economic productivity
Experts design prompts for high-value knowledge work
LM judge evaluates 23 frontier models' performance
🔎 Similar Papers
No similar papers found.
Bertie Vidgen
Bertie Vidgen
Oxford, Mercor
EvalsMCP + RAGAlignment + SafetyContent Moderation
A
Abby Fennelly
Mercor
E
Evan Pinnix
Mercor
C
Chirag Mahapatra
Mercor
Z
Zach Richards
Mercor
A
Austin Bridges
Mercor
C
Calix Huang
Mercor
B
Ben Hunsberger
Mercor
F
Fez Zafar
Mercor
B
Brendan Foody
Mercor
D
Dominic Barton
Mercor
C
Cass R. Sunstein
Harvard Law School
Eric Topol
Eric Topol
Professor and EVP, Scripps Research
A.I.genomicsdigitalindividualized medicine
Osvald Nitski
Osvald Nitski
Product Manager at Mercor
Machine LearningNatural Language ProcessingArtificial Intelligence