APEX-SWE

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Current AI evaluation methodologies are largely confined to narrow tasks and fail to assess models’ capacity to perform high-value work in real-world software engineering contexts. This work proposes the APEX-SWE benchmark, establishing the first evaluation paradigm centered on authentic software engineering workflows. It evaluates models through two challenge types—integrated tasks and observability tasks—to probe cognitive reasoning and proactive decision-making in complex, open-ended environments. The assessment incorporates end-to-end system integration, cloud-native interactions, and telemetry signal analysis, combining both structured and unstructured contextual information. Among eight state-of-the-art models evaluated, Gemini 3 Pro (with Thinking=High) achieves the highest performance (Pass@1 of 25%), owing to its ability to effectively distinguish hypotheses from facts and actively resolve uncertainty.

Technology Category

Application Category

📝 Abstract

We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering work: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eight frontier models on APEX-SWE. Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25\%. Our analysis shows that strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).

Problem

Research questions and friction points this paper is trying to address.

software engineering

AI evaluation

integration tasks

observability tasks

frontier AI models

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI Productivity Index

Integration Tasks

Observability Tasks

Epistemic Reasoning

Software Engineering Benchmark

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Software Engineer, AI Native