SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing AI software engineering benchmarks (e.g., SWE-bench) evaluate only task accuracy, neglecting system effectiveness under resource constraints. Method: We propose SWE-Effi, a multidimensional efficiency evaluation framework that jointly models token consumption, response latency, and task success rate—enabling the first systematic re-evaluation of mainstream AI agent systems under resource-limited conditions. Contribution/Results: Our analysis uncovers two prevalent inefficiencies: “token avalanches” (exponential token growth from unproductive loops) and “expensive failures” (high-cost yet unsuccessful task attempts). We demonstrate that agent architecture and base model co-design critically impact resource efficiency. Empirical evaluation on a SWE-bench subset reveals substantial misalignment between accuracy-based and efficiency-based rankings, and a strong trade-off between token and time budgets. This work establishes a quantifiable benchmark and diagnostic toolkit to reduce RL training costs and enhance the deployability of AI engineering systems.

Technology Category

Application Category

📝 Abstract

The advancement of large language models (LLMs) and code agents has demonstrated significant potential to assist software engineering (SWE) tasks, such as autonomous issue resolution and feature addition. Existing AI for software engineering leaderboards (e.g., SWE-bench) focus solely on solution accuracy, ignoring the crucial factor of effectiveness in a resource-constrained world. This is a universal problem that also exists beyond software engineering tasks: any AI system should be more than correct - it must also be cost-effective. To address this gap, we introduce SWE-Effi, a set of new metrics to re-evaluate AI systems in terms of holistic effectiveness scores. We define effectiveness as the balance between the accuracy of outcome (e.g., issue resolve rate) and the resources consumed (e.g., token and time). In this paper, we specifically focus on the software engineering scenario by re-ranking popular AI systems for issue resolution on a subset of the SWE-bench benchmark using our new multi-dimensional metrics. We found that AI system's effectiveness depends not just on the scaffold itself, but on how well it integrates with the base model, which is key to achieving strong performance in a resource-efficient manner. We also identified systematic challenges such as the "token snowball" effect and, more significantly, a pattern of "expensive failures". In these cases, agents consume excessive resources while stuck on unsolvable tasks - an issue that not only limits practical deployment but also drives up the cost of failed rollouts during RL training. Lastly, we observed a clear trade-off between effectiveness under the token budget and effectiveness under the time budget, which plays a crucial role in managing project budgets and enabling scalable reinforcement learning, where fast responses are essential.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI software agents beyond accuracy under resource constraints

Introducing holistic metrics balancing outcome accuracy and resource consumption

Addressing token and time efficiency trade-offs in AI system deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing SWE-Effi metrics for holistic evaluation

Defining effectiveness as accuracy-resource consumption balance

Re-ranking AI systems using multi-dimensional metrics

🔎 Similar Papers

Human Delegation Behavior in Human-AI Collaboration: The Effect of Contextual Information