Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

📅 2024-08-26
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This study presents the first systematic evaluation of large language models (LLMs)—including GPT-4o and Gemini—on spatial cognition tasks, focusing on core challenges such as spatial concept understanding and route planning. To this end, we introduce the first multi-task benchmark dataset covering 12 distinct spatial reasoning categories. We propose a novel “difficulty-stratified, two-stage evaluation” paradigm: zero-shot assessment followed by prompt tuning. Innovatively, we integrate chain-of-thought (CoT) prompting and single-shot prompting strategies. Results show that GPT-4o achieves a zero-shot average accuracy of 71.3%; CoT boosts route-planning accuracy from 12.4% to 87.5%; and single-shot prompting raises map-based task accuracy to 76.3%. All evaluations are rigorously validated by human annotators. This work establishes a reproducible benchmark and methodological framework for assessing spatial intelligence in LLMs.

Technology Category

Application Category

📝 Abstract
The emergence of large language models such as ChatGPT, Gemini, and others highlights the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been thoroughly assessed. This study addresses this gap by introducing a new multi-task spatial evaluation dataset designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach. First, we conducted zero-shot testing. Then, we categorized the dataset by difficulty and performed prompt-tuning tests. Results show that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For instance, the Chain-of-Thought (CoT) strategy increased gpt-4o's accuracy in simple route planning from 12.4% to 87.5%, while a one-shot strategy improved moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.
Problem

Research questions and friction points this paper is trying to address.

Spatial Reasoning
Language Models
Navigation Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Reasoning
Multi-task Testing
Prompt Strategies
🔎 Similar Papers
No similar papers found.
L
Liuchang Xu
School of Mathematics and Computer Science, Zhejiang Agriculture and Forestry University, Hangzhou 311300, China; Financial Big Data Research Institute, Sunyard Technology Co., Ltd., Hangzhou 310053, China
Shuo Zhao
Shuo Zhao
Graduate Student in Department of Chemistry Carnegie Mellon University
Electrocatalysis
Q
Qingming Lin
School of Mathematics and Computer Science, Zhejiang Agriculture and Forestry University, Hangzhou 311300, China
L
Luyao Chen
School of Mathematics and Computer Science, Zhejiang Agriculture and Forestry University, Hangzhou 311300, China
Q
Qianqian Luo
School of Mathematics and Computer Science, Zhejiang Agriculture and Forestry University, Hangzhou 311300, China
S
Sensen Wu
School of Earth Sciences, Zhejiang University, Hangzhou 310058, China
X
Xinyue Ye
Department of Landscape Architecture and Urban Planning & Center for Geospatial Sciences, Applications and Technology, Texas A&M University, College Station, TX, 77843
H
Hailin Feng
School of Mathematics and Computer Science, Zhejiang Agriculture and Forestry University, Hangzhou 311300, China
Z
Zhenhong Du
School of Earth Sciences, Zhejiang University, Hangzhou 310058, China