🤖 AI Summary
Existing LLM-based software engineering evaluation benchmarks suffer from narrow task coverage, monolingual bias, and misalignment with real-world development workflows. To address these limitations, we propose SE-Bench—the first production-aligned, unified evaluation framework for LLM-based coding agents. SE-Bench comprises 2,000 high-quality, real-world instances sourced from GitHub Pull Requests, covering eight task types, eight development scenarios, and ten programming languages. It employs a systematic data curation pipeline and dual-agent verification (using SWE-Agent and Claude Code) to ensure correctness and relevance. Crucially, SE-Bench reveals, for the first time, fine-grained difficulty distributions across tasks, languages, and scenarios. We evaluate ten state-of-the-art models, demonstrating SE-Bench’s reliability, diagnostic utility, and robustness. The framework significantly advances beyond traditional benchmarks in breadth, linguistic and contextual diversity, and realism—establishing a reproducible, production-grounded standard for assessing LLM agent coding capabilities.
📝 Abstract
Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.