🤖 AI Summary
Existing AI benchmarks predominantly assess programming proficiency, exhibiting a critical gap in systematically evaluating high-economic-value knowledge work—such as investment banking, management consulting, legal practice, and primary healthcare.
Method: We introduce APEX-v1.0, the first comprehensive benchmark for this domain, comprising 200 realistic, expert-designed tasks with fine-grained, rubric-based scoring. Leveraging an innovative “expert-defined tasks + LLM-based automated adjudication” paradigm, we conduct large-scale, reproducible evaluations across 23 state-of-the-art models.
Contribution/Results: GPT-5 (Thinking=High) achieves the highest average accuracy (64.2%), while Qwen3-235B emerges as the top-performing open-weight model. Nevertheless, all models fall substantially short of human expert performance. APEX-v1.0 establishes the first standardized, scalable, and reproducible evaluation infrastructure for non-coding professional reasoning, advancing both technical development and responsible governance of AI in high-stakes knowledge-intensive domains.
📝 Abstract
We introduce the first version of the AI Productivity Index (APEX), a benchmark for assessing whether frontier AI models can perform knowledge work with high economic value. APEX addresses one of the largest inefficiencies in AI research: outside of coding, benchmarks often fail to test economically relevant capabilities. APEX-v1.0 contains 200 test cases and covers four domains: investment banking, management consulting, law, and primary medical care. It was built in three steps. First, we sourced experts with top-tier experience e.g., investment bankers from Goldman Sachs. Second, experts created prompts that reflect high-value tasks in their day-to-day work. Third, experts created rubrics for evaluating model responses. We evaluate 23 frontier models on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) (60.4%). Qwen 3 235B is the best performing open-source model and seventh best overall. There is a large gap between the performance of even the best models and human experts, highlighting the need for better measurement of models' ability to produce economically valuable work.