🤖 AI Summary
Current AI evaluation methodologies are largely confined to narrow tasks and fail to assess models’ capacity to perform high-value work in real-world software engineering contexts. This work proposes the APEX-SWE benchmark, establishing the first evaluation paradigm centered on authentic software engineering workflows. It evaluates models through two challenge types—integrated tasks and observability tasks—to probe cognitive reasoning and proactive decision-making in complex, open-ended environments. The assessment incorporates end-to-end system integration, cloud-native interactions, and telemetry signal analysis, combining both structured and unstructured contextual information. Among eight state-of-the-art models evaluated, Gemini 3 Pro (with Thinking=High) achieves the highest performance (Pass@1 of 25%), owing to its ability to effectively distinguish hypotheses from facts and actively resolve uncertainty.
📝 Abstract
We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering work: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eight frontier models on APEX-SWE. Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25\%. Our analysis shows that strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).