APEX-SWE

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI evaluation methodologies are largely confined to narrow tasks and fail to assess models’ capacity to perform high-value work in real-world software engineering contexts. This work proposes the APEX-SWE benchmark, establishing the first evaluation paradigm centered on authentic software engineering workflows. It evaluates models through two challenge types—integrated tasks and observability tasks—to probe cognitive reasoning and proactive decision-making in complex, open-ended environments. The assessment incorporates end-to-end system integration, cloud-native interactions, and telemetry signal analysis, combining both structured and unstructured contextual information. Among eight state-of-the-art models evaluated, Gemini 3 Pro (with Thinking=High) achieves the highest performance (Pass@1 of 25%), owing to its ability to effectively distinguish hypotheses from facts and actively resolve uncertainty.

Technology Category

Application Category

📝 Abstract
We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering work: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eight frontier models on APEX-SWE. Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25\%. Our analysis shows that strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).
Problem

Research questions and friction points this paper is trying to address.

software engineering
AI evaluation
integration tasks
observability tasks
frontier AI models
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI Productivity Index
Integration Tasks
Observability Tasks
Epistemic Reasoning
Software Engineering Benchmark
🔎 Similar Papers
No similar papers found.
A
Abhishek Kottamasu
Mercor
A
Akul Datta
Mercor
A
Aakash Barthwal
Mercor
C
Chirag Mahapatra
Mercor
A
Ajay Arun
Mercor
A
Adarsh Hiremath
Mercor
B
Brendan Foody
Mercor
Bertie Vidgen
Bertie Vidgen
Oxford, Mercor
EvalsMCP + RAGAlignment + SafetyContent Moderation