TRAIL: Trace Reasoning and Agentic Issue Localization

πŸ“… 2025-05-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Proxy workflow evaluation suffers from poor scalability and difficulty in error attribution: existing methods rely on manual analysis of lengthy trajectories, rendering them inadequate for complex, heterogeneous agent outputs. This paper introduces TRAILβ€”the first large-scale, human-annotated dataset of agent trajectories (148 real-world, multi-scenario long trajectories)β€”and systematically constructs the first structured error taxonomy covering both single- and multi-agent settings across software engineering and open-domain retrieval. We further develop a high-quality, expert-validated tracing benchmark grounded in established benchmarks including SWE-bench and WebShop. Experiments reveal that state-of-the-art long-context models (e.g., Gemini-2.5-pro) achieve only 11% accuracy on trajectory-level debugging tasks. The full dataset, annotation guidelines, and code are publicly released to advance trustworthy evaluation of agent systems.

Technology Category

Application Category

πŸ“ Abstract
The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.
Problem

Research questions and friction points this paper is trying to address.

Scalable evaluation of complex agentic workflow traces
Dynamic error analysis in agentic systems
Poor performance of LLMs in trace debugging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces taxonomy for agentic system errors
Provides 148 human-annotated traces (TRAIL)
Evaluates LLMs' poor performance on trace debugging
πŸ”Ž Similar Papers
No similar papers found.