A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Prior research predominantly examines isolated tasks, lacking systematic evaluation of general-purpose intelligent agents’ holistic capabilities in software engineering. Method: This paper conducts the first cross-task, multi-dimensional empirical comparison of seven state-of-the-art agent frameworks—including AgentOrchestra, OpenHands, and GPTswarm—across three code-centric tasks: software development, vulnerability detection, and program repair. We establish a unified analytical framework assessing effectiveness (task success rate), efficiency (trajectory length, edit count), and cost (token consumption), using standardized pipelines and established benchmarks. Contribution/Results: Our evaluation reveals significant architectural trade-offs: AgentOrchestra incurs high coordination overhead but excels in reflective reasoning; GPTswarm achieves the lowest token cost; software development exhibits moderate effectiveness yet highest resource cost. The study uncovers deep correlations between agent architecture design and real-world performance, providing empirically grounded guidance for designing and selecting efficient software engineering agents.

Technology Category

Application Category

📝 Abstract

Unlike traditional automation tools or static LLM-based systems, agents combine decision-making and tool utilization to accomplish complex tasks, showing great potential in software engineering. However, existing studies largely focus on specific tasks or isolated aspects, providing an incomplete picture of agents' practical capabilities. To address this, we conduct a comprehensive empirical study evaluating seven general-purpose agent frameworks across three representative code-centric tasks: software development, vulnerability detection, and program repair. Each task is assessed using standard, widely adopted benchmarks to ensure objective and comparable evaluation. Agent performance is systematically analyzed from three complementary perspectives: effectiveness (task success), efficiency (execution process), and overhead (token consumption). Our findings reveal distinct capability patterns and trade-offs among the evaluated frameworks. In terms of effectiveness, agents achieve moderate overall performance. Regarding efficiency, AgentOrchestra tends to exhibit the longest trajectories and the most correction attempts due to coordination overhead, whereas OpenHands demonstrate stronger reflective reasoning abilities. For overhead, software development incurs the highest monetary cost, while GPTswarm remains the most cost-efficient. Furthermore, we conduct an in-depth cross-analysis of the relationship between effectiveness and efficiency, exploring the underlying reasons behind their interplay. These findings guide both practical adoption and future research toward more efficient software engineering agents.

Problem

Research questions and friction points this paper is trying to address.

Evaluating agent frameworks on code-centric software engineering tasks

Assessing agent performance across effectiveness, efficiency, and overhead metrics

Identifying capability patterns and trade-offs among different agent frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates seven agent frameworks on code tasks

Analyzes effectiveness efficiency overhead systematically

Reveals capability patterns and trade-offs among frameworks

🔎 Similar Papers

Large Language Model-Based Agents for Software Engineering: A Survey