🤖 AI Summary
This work investigates the robust planning and recovery capabilities of language model agents when external functions fail unexpectedly. To this end, we introduce the first benchmark specifically designed to evaluate agents’ ability to recover from external failures—a dynamic task environment comprising 4,000+ functions, where all tasks remain solvable despite interference. Methodologically, agents leverage real-time environmental feedback (success/error signals) to search for alternative execution paths within constrained search spaces; we systematically assess both open-source and commercial large language models under this paradigm. Experimental results reveal that current agents struggle to effectively utilize runtime feedback for adaptation: scaling invocation budgets yields only marginal improvements in fallback generation. Our core contribution is the first standardized evaluation framework for external-failure recovery, empirically exposing a fundamental limitation in existing agents’ capacity for dynamic plan revision.
📝 Abstract
As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? We devise a specialized agentic planning benchmark to study this question. Each planning problem is solved via combinations of function calls. The agent searches for relevant functions from a set of over four thousand possibilities, and observes environmental feedback in the form of function outputs or error messages. Our benchmark confronts the agent with external failures in its workflow, such as functions that suddenly become unavailable. At the same time, even with the introduction of these failures, we guarantee that the task remains solvable. Ideally, an agent's performance on the planning task should not be affected by the presence of external failures. Overall, we find that language agents struggle to formulate and execute backup plans in response to environment feedback. While state-of-the-art models are often able to identify the correct function to use in the right context, they struggle to adapt to feedback from the environment and often fail to pursue alternate courses of action, even when the search space is artificially restricted. We provide a systematic analysis of the failures of both open-source and commercial models, examining the effects of search space size, as well as the benefits of scaling model size in our setting. Our analysis identifies key challenges for current generative models as well as promising directions for future work.