TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automated diagnosis methods suffer significant performance degradation when handling long and noisy execution trajectories in warehouse-scale coding tasks. This work proposes TrajAudit, the first failure diagnosis framework tailored for such trajectories, which employs an investigator agent that integrates information filtering with a prior diagnosis module. By combining pattern matching, test report analysis, and on-demand tool invocation, TrajAudit efficiently pinpoints root causes of failures. The contributions include a novel noise-filtering mechanism based on test reports and pattern matching, the release of RootSE—the most complex trajectory diagnosis benchmark to date—and empirical results demonstrating over 24.4 percentage points higher localization accuracy than existing methods on RootSE, while reducing token consumption by at least 18%.
📝 Abstract
Agentic systems have been widely studied to automate software engineering jobs such as bug fixing. As these systems increasingly tackle complex tasks, understanding where and why they fail becomes essential for iterative refinement and operational reliability. Existing automated failure diagnosis approaches leverage task execution trajectories, yet their effectiveness degrades substantially as trajectory length and complexity increase. For repository-level coding tasks specifically, trajectories are laden with noise, such as redundant program structure and verbose code context. Moreover, these trajectories are very long, while long-context reasoning remains a known weakness of LLMs. To address these two challenges, we propose TrajAudit, the first failure diagnosis framework for repository-level coding trajectories. TrajAudit employs an investigator agent supported by two modules: one filters failure-irrelevant information through pattern matching and keyword detection, and the other generates a preliminary diagnosis from test failure reports as prior knowledge, helping the agent handle noisy long contexts. The investigator agent can further invoke tools to retrieve filtered content on demand, ensuring that critical information is preserved while noise is minimized. We also introduce RootSE, a benchmark of 93 real-world agentic failure instances sourced from software maintenance tasks, representing the most complex trajectory diagnosis benchmark to date. Experiments on RootSE show that TrajAudit outperforms all existing baselines by over 24.4 percentage points in localization accuracy, while reducing token consumption by at least 18%, demonstrating its practical effectiveness. We hope this work draws community attention to failure management in agentic software engineering and provides a foundational resource for future research.
Problem

Research questions and friction points this paper is trying to address.

failure diagnosis
agentic coding systems
execution trajectories
long-context reasoning
noise filtering
Innovation

Methods, ideas, or system contributions that make the work stand out.

failure diagnosis
agentic coding systems
trajectory filtering
long-context reasoning
RootSE benchmark