๐ค AI Summary
Current evaluations of large language model (LLM) agents predominantly focus on task success or failure, often overlooking redundant steps in the execution process, which leads to inefficiency and resource waste. This work formally defines the problem of "redundant step detection" for the first time, establishing it as a novel research direction, and introduces RedundancyBenchโthe first dedicated benchmark comprising diverse tasks and agent trajectories with fine-grained annotations. Systematic evaluation using LLM-based trajectory analysis, human annotations, and three representative methods reveals that even the best-performing approach achieves only 24.88% accuracy, with some performing below random chance. These findings underscore the significant challenge of detecting redundancy and lay the groundwork for systematic assessment of agent execution efficiency.
๐ Abstract
LLM-based agents have demonstrated strong capabilities in solving complex tasks through multi-step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agent behavior: execution efficiency. In practice, agent trajectories often contain redundant steps that consume substantial resources while contributing little to task completion. In this work, we propose and formulate a new research area: \textbf{redundant step detection} for agent trajectories. To support this initiative, we introduce \textbf{RedundancyBench}, a new benchmark that contains diverse tasks with carefully annotated trajectories, where each step is labeled according to its contribution to task completion. Using RedundancyBench, we develop and evaluate 3 representative methods to answer whether a step within trajectory is redundant or necessary. Our results show that even the best-performing method achieves only 24.88\% score in detecting redundant steps, while some methods perform worse than random guessing. These results highlight the task's complexity and the need for further research in this area. \footnote{Code and dataset in this paper are both available in \href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}.}