Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the interpretability of large language model (LLM)-based code agents’ behavior in complex software engineering tasks, moving beyond aggregate success rates to examine fine-grained execution trajectories for both successes and failures. Method: We propose a trajectory-based behavioral understanding framework, integrating qualitative and quantitative analyses, and conduct an empirical comparative study of three prominent agents—OpenHands, SWE-agent, and Prometheus—on the SWE-Bench benchmark. Contribution/Results: We identify (1) defensive programming and proactive context gathering as critical success strategies; (2) fault localization accuracy (72–81% at the file level) is not decisive—instead, approximate correctness of code modifications proves more discriminative; and (3) failure trajectories are significantly longer and more volatile, with agent-specific failure patterns. These findings provide novel empirical insights and a conceptual foundation for developing robust, interpretable autonomous programming systems.

Technology Category

Application Category

📝 Abstract

The increasing deployment of Large Language Model (LLM) agents for complex software engineering tasks has created a need to understand their problem-solving behaviours beyond simple success metrics. While these agents demonstrate impressive capabilities in automated issue resolution, their decision-making processes remain largely opaque. This paper presents an empirical study of agent trajectories, namely the execution traces capturing the steps agents take when attempting to resolve software issues. We analyse trajectories from three state-of-the-art code agents (OpenHands, SWE-agent, and Prometheus) on the SWE-Bench benchmark, examining both successful and failed attempts. Our investigation reveals several key insights into agent behaviour. First, we identify how distinct problem-solving strategies, such as defensive programming and context gathering, enable success in different scenarios. Second, we find that failed trajectories are consistently longer and exhibit higher variance than successful ones, with failure patterns differing significantly between agents. Third, our fault localisation analysis shows that while most trajectories correctly identify problematic files (72-81% even in failures), success depends more on achieving approximate rather than exact code modifications. These and other findings unveiled by our study, provide a foundation for understanding agent behaviour through trajectory analysis, contributing to the development of more robust and interpretable autonomous software engineering systems.

Problem

Research questions and friction points this paper is trying to address.

Analyzing code agent problem-solving behaviors beyond simple success metrics

Investigating decision-making processes in successful and failed agent trajectories

Identifying key factors influencing autonomous software engineering system performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing agent execution traces for behavior insights

Identifying problem-solving strategies in software engineering tasks

Evaluating fault localization and code modification accuracy

🔎 Similar Papers

No similar papers found.

Authors to Follow