DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing agent evaluation benchmarks primarily focus on localized reasoning, failing to assess capabilities essential in real-world scenarios—such as long-term planning, proactive information gathering, and the joint optimization of global and local constraints. To address this gap, this work introduces the first agent planning benchmark tailored for long-horizon tasks, featuring multi-day travel and multi-item shopping scenarios that require agents to satisfy global constraints (e.g., time and budget) while handling fine-grained local constraints and actively acquiring necessary information. We propose a large language model–based agent framework that integrates explicit reasoning with parallel tool invocation, enabling efficient generation and validation of plans under complex constraints. Experimental results reveal that current state-of-the-art agents perform poorly on this benchmark, highlighting the critical role of explicit reasoning and parallel tool use in advancing planning performance and offering a new direction for future research.

Technology Category

Application Category

📝 Abstract
While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.
Problem

Research questions and friction points this paper is trying to address.

long-horizon planning
constrained optimization
agentic reasoning
information gathering
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon planning
constrained optimization
agentic reasoning
active information gathering
parallel tool use
Y
Yinger Zhang
Qwen Team, Alibaba Group
S
Shutong Jiang
Qwen Team, Alibaba Group
R
Renhao Li
Qwen Team, Alibaba Group
J
Jianhong Tu
Qwen Team, Alibaba Group
Yang Su
Yang Su
King's College London
L
Lianghao Deng
Qwen Team, Alibaba Group
X
Xudong Guo
Qwen Team, Alibaba Group
C
Chenxu Lv
Qwen Team, Alibaba Group
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining