SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based software engineering evaluation benchmarks suffer from narrow task coverage, monolingual bias, and misalignment with real-world development workflows. To address these limitations, we propose SE-Bench—the first production-aligned, unified evaluation framework for LLM-based coding agents. SE-Bench comprises 2,000 high-quality, real-world instances sourced from GitHub Pull Requests, covering eight task types, eight development scenarios, and ten programming languages. It employs a systematic data curation pipeline and dual-agent verification (using SWE-Agent and Claude Code) to ensure correctness and relevance. Crucially, SE-Bench reveals, for the first time, fine-grained difficulty distributions across tasks, languages, and scenarios. We evaluate ten state-of-the-art models, demonstrating SE-Bench’s reliability, diagnostic utility, and robustness. The framework significantly advances beyond traditional benchmarks in breadth, linguistic and contextual diversity, and realism—establishing a reproducible, production-grounded standard for assessing LLM agent coding capabilities.

Technology Category

Application Category

📝 Abstract
Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types, languages, and scenarios. Moreover, by aligning evaluation with real-world developer practices, SWE-Compass provides a rigorous and reproducible foundation for diagnosing and advancing agentic coding capabilities in large language models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for software engineering faces narrow task coverage and language bias
Existing benchmarks underexplore critical software engineering dimensions beyond Python
Need comprehensive evaluation aligned with real-world developer workflows and practices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified benchmark for agentic coding evaluation
Structured framework with 8 tasks and 10 languages
Real-world GitHub data with systematic validation
🔎 Similar Papers
No similar papers found.
J
Jingxuan Xu
Kuaishou Technology
Ken Deng
Ken Deng
Kwaipilot Team, Kuaishou Technology
LLMAI4SEAI Agent
Weihao Li
Weihao Li
Research Fellow, Australian National University
Computer VisionMachine Learning
S
Songwei Yu
Kuaishou Technology
H
Huaixi Tang
Kuaishou Technology
Haoyang Huang
Haoyang Huang
JD Explore Academy (present) | StepFun | Microsoft Research
Multimodal & Multilingual Foundation Model
Z
Zhiyi Lai
Kuaishou Technology
Z
Zizheng Zhan
Kuaishou Technology
Y
Yanan Wu
Kuaishou Technology
C
Chenchen Zhang
Kuaishou Technology
K
Kepeng Lei
Kuaishou Technology
Yifan Yao
Yifan Yao
Drexel University
X
Xinping Lei
Kuaishou Technology
W
Wen-ya Zhu
Kuaishou Technology
Z
Zong-Xian Feng
Kuaishou Technology
H
Han Li
Kuaishou Technology
J
Junqi Xiong
Kuaishou Technology
D
Dailin Li
Kuaishou Technology
Zuchen Gao
Zuchen Gao
Phd Candidate of The Hong Kong Polytechnic University
K
Kun Wu
Kuaishou Technology
W
Wen Xiang
Kuaishou Technology
Z
Ziqi Zhan
Kuaishou Technology
Yuanxing Zhang
Yuanxing Zhang
Kuaishou Technology
Recommender SystemLarge Language ModelVideo Understanding
W
Wuxuan Gong
Kuaishou Technology
Z
Ziyuan Gao
Kuaishou Technology
G
Guanxiang Wang
Kuaishou Technology
Y
Yirong Xue
Kuaishou Technology
M
Mengtong Li
Kuaishou Technology
M
Mengfei Xie
Kuaishou Technology
X
Xiaojiang Zhang
Kuaishou Technology
J
Jinghui Wang
Kuaishou Technology
Wenhao Zhuang
Wenhao Zhuang
Kuaishou Technology
Natural Language Processing
Z
Zheng Lin
Kuaishou Technology
Huiming Wang
Huiming Wang
Chongqing University of Posts and Telecommunications
Disturbance rejection control theory (such as active disturbance rejection controlsliding mode
Zhaoxiang Zhang
Zhaoxiang Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionPattern RecognitionBiologically-inspired Learning
Y
Yuqun Zhang
Kuaishou Technology
H
Haotian Zhang
Kuaishou Technology
B
Bin Chen
Kuaishou Technology
J
Jiaheng Liu
Nanjing University