Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning

📅 2023-10-05

🏛️ arXiv.org

📈 Citations: 19

✨ Influential: 1

career value

186K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant limitations in spatio-temporal joint reasoning—particularly for path planning—yet no natural-language-driven benchmark exists to systematically evaluate such capabilities. Method: We introduce PPNL, the first natural-language-based path planning benchmark, designed to assess LLMs’ performance in obstacle avoidance, constraint satisfaction, and spatio-temporal navigation under complex environments. We propose an interleaved “reasoning–action” prompting paradigm and conduct full-parameter fine-tuning of BART/T5 using structured environment modeling and task chain decomposition. Contribution/Results: Our analysis reveals that GPT-4 possesses strong short-horizon spatial reasoning but severely struggles with long-horizon planning. Fine-tuned models achieve high in-distribution accuracy but suffer sharp generalization degradation in larger or more complex environments. PPNL provides a reproducible evaluation framework and methodological insights for advancing LLMs’ spatio-temporal reasoning capabilities.

📝 Abstract

Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks; however, they still face limitations in scenarios that demand long-term planning and spatial reasoning. To facilitate this line of research, in this work, we propose a new benchmark, termed $ extbf{P}$ath $ extbf{P}$lanning from $ extbf{N}$atural $ extbf{L}$anguage ($ extbf{PPNL}$). Our benchmark evaluates LLMs' spatial-temporal reasoning by formulating ''path planning'' tasks that require an LLM to navigate to target locations while avoiding obstacles and adhering to constraints. Leveraging this benchmark, we systematically investigate LLMs including GPT-4 via different few-shot prompting methodologies as well as BART and T5 of various sizes via fine-tuning. Our experimental results show the promise of few-shot GPT-4 in spatial reasoning, when it is prompted to reason and act interleavedly, although it still fails to perform long-term temporal reasoning. In contrast, while fine-tuned LLMs achieved impressive results on in-distribution reasoning tasks, they struggled to generalize to larger environments or environments with more obstacles.

Problem

Research questions and friction points this paper is trying to address.

LLMs' spatial-temporal reasoning

path planning tasks

generalization in larger environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for path planning

Spatial-temporal reasoning benchmark

Few-shot GPT-4 prompting

🔎 Similar Papers

No similar papers found.