🤖 AI Summary
Existing agent evaluation benchmarks primarily focus on localized reasoning, failing to assess capabilities essential in real-world scenarios—such as long-term planning, proactive information gathering, and the joint optimization of global and local constraints. To address this gap, this work introduces the first agent planning benchmark tailored for long-horizon tasks, featuring multi-day travel and multi-item shopping scenarios that require agents to satisfy global constraints (e.g., time and budget) while handling fine-grained local constraints and actively acquiring necessary information. We propose a large language model–based agent framework that integrates explicit reasoning with parallel tool invocation, enabling efficient generation and validation of plans under complex constraints. Experimental results reveal that current state-of-the-art agents perform poorly on this benchmark, highlighting the critical role of explicit reasoning and parallel tool use in advancing planning performance and offering a new direction for future research.
📝 Abstract
While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.