SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing LLM-based software engineering evaluation benchmarks suffer from narrow task coverage, monolingual bias, and misalignment with real-world development workflows. To address these limitations, we propose SE-Bench—the first production-aligned, unified evaluation framework for LLM-based coding agents. SE-Bench comprises 2,000 high-quality, real-world instances sourced from GitHub Pull Requests, covering eight task types, eight development scenarios, and ten programming languages. It employs a systematic data curation pipeline and dual-agent verification (using SWE-Agent and Claude Code) to ensure correctness and relevance. Crucially, SE-Bench reveals, for the first time, fine-grained difficulty distributions across tasks, languages, and scenarios. We evaluate ten state-of-the-art models, demonstrating SE-Bench’s reliability, diagnostic utility, and robustness. The framework significantly advances beyond traditional benchmarks in breadth, linguistic and contextual diversity, and realism—establishing a reproducible, production-grounded standard for assessing LLM agent coding capabilities.

Technology Category

Application Category

📝 Abstract

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for software engineering faces narrow task coverage and language bias

Existing benchmarks underexplore critical software engineering dimensions beyond Python

Need comprehensive evaluation aligned with real-world developer workflows and practices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmark for agentic coding evaluation

Structured framework with 8 tasks and 10 languages

Real-world GitHub data with systematic validation

🔎 Similar Papers

No similar papers found.