Revisiting the Travel Planning Capabilities of Large Language Models

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

176K/year
🤖 AI Summary
This work addresses the poor performance of large language models (LLMs) on long-horizon reasoning tasks such as travel planning and the lack of interpretability in existing evaluation methods. The authors propose the first atomic capability decomposition framework tailored to travel planning, breaking the task into five modules: constraint extraction, tool utilization, plan generation, error identification, and correction. They further introduce an oracle-assisted intermediate-state evaluation protocol that isolates component performance to mitigate cascading errors. Experimental results reveal that while LLMs effectively extract explicit constraints, they exhibit systematic deficiencies in reasoning about implicit user needs, ensuring structural coherence of plans, and self-correction—manifesting as excessive sensitivity and error persistence. This approach substantially enhances both the precision and interpretability of model evaluation.
📝 Abstract
Travel planning serves as a critical task for long-horizon reasoning, exposing significant deficits in LLMs. However, existing benchmarks and evaluations primarily assess final plans in an end-to-end manner, which lacks interpretability and makes it difficult to analyze the root causes of failures. To bridge this gap, we decompose travel planning into five constituent atomic sub-capabilities, including \emph{Constraint Extraction}, \emph{Tool Use}, \emph{Plan Generation}, \emph{Error Identification}, and \emph{Error Correction}. We implement a decoupled evaluation protocol leveraging oracle intermediate contexts to rigorously isolate these components, thereby measuring the atomic performance boundary without the noise of cascading errors. Our results highlight a clear contrast in performance: while LLMs are proficient in extracting explicit constraints, they struggle to infer implicit, open-world requirements. Furthermore, they exhibit structural biases in plan generation and suffer from ineffective self-correction, characterized by excessive sensitivity and erroneous persistence. These findings offer precise directions for improving LLM reasoning and planning abilities.
Problem

Research questions and friction points this paper is trying to address.

travel planning
large language models
evaluation benchmark
long-horizon reasoning
interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

atomic sub-capabilities
decoupled evaluation
oracle intermediate contexts
cascading error isolation
LLM reasoning analysis
B
Bo-Wen Zhang
State Key Laboratory of Novel Software Technology, Nanjing University, China; School of Intelligence Science and Technology, Nanjing University, China
J
Jin Ye
State Key Laboratory of Novel Software Technology, Nanjing University, China; School of Intelligence Science and Technology, Nanjing University, China
P
Peng-Yu Hua
State Key Laboratory of Novel Software Technology, Nanjing University, China; School of Intelligence Science and Technology, Nanjing University, China
J
Jia-Wei Cao
State Key Laboratory of Novel Software Technology, Nanjing University, China; School of Intelligence Science and Technology, Nanjing University, China
Jie-Jing Shao
Jie-Jing Shao
Nanjing University
Machine LearningNeuro-Symbolic LearningReinforcement Learning
Yu-Feng Li
Yu-Feng Li
Professor, Nanjing University
Machine Learning
Lan-Zhe Guo
Lan-Zhe Guo
LAMDA Group, Nanjing University
Machine Learning