🤖 AI Summary
This study addresses the interpretability of large language model (LLM)-based code agents’ behavior in complex software engineering tasks, moving beyond aggregate success rates to examine fine-grained execution trajectories for both successes and failures.
Method: We propose a trajectory-based behavioral understanding framework, integrating qualitative and quantitative analyses, and conduct an empirical comparative study of three prominent agents—OpenHands, SWE-agent, and Prometheus—on the SWE-Bench benchmark.
Contribution/Results: We identify (1) defensive programming and proactive context gathering as critical success strategies; (2) fault localization accuracy (72–81% at the file level) is not decisive—instead, approximate correctness of code modifications proves more discriminative; and (3) failure trajectories are significantly longer and more volatile, with agent-specific failure patterns. These findings provide novel empirical insights and a conceptual foundation for developing robust, interpretable autonomous programming systems.
📝 Abstract
The increasing deployment of Large Language Model (LLM) agents for complex software engineering tasks has created a need to understand their problem-solving behaviours beyond simple success metrics. While these agents demonstrate impressive capabilities in automated issue resolution, their decision-making processes remain largely opaque. This paper presents an empirical study of agent trajectories, namely the execution traces capturing the steps agents take when attempting to resolve software issues. We analyse trajectories from three state-of-the-art code agents (OpenHands, SWE-agent, and Prometheus) on the SWE-Bench benchmark, examining both successful and failed attempts. Our investigation reveals several key insights into agent behaviour. First, we identify how distinct problem-solving strategies, such as defensive programming and context gathering, enable success in different scenarios. Second, we find that failed trajectories are consistently longer and exhibit higher variance than successful ones, with failure patterns differing significantly between agents. Third, our fault localisation analysis shows that while most trajectories correctly identify problematic files (72-81% even in failures), success depends more on achieving approximate rather than exact code modifications. These and other findings unveiled by our study, provide a foundation for understanding agent behaviour through trajectory analysis, contributing to the development of more robust and interpretable autonomous software engineering systems.