Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

📅 2024-08-26

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study presents the first systematic evaluation of large language models (LLMs)—including GPT-4o and Gemini—on spatial cognition tasks, focusing on core challenges such as spatial concept understanding and route planning. To this end, we introduce the first multi-task benchmark dataset covering 12 distinct spatial reasoning categories. We propose a novel “difficulty-stratified, two-stage evaluation” paradigm: zero-shot assessment followed by prompt tuning. Innovatively, we integrate chain-of-thought (CoT) prompting and single-shot prompting strategies. Results show that GPT-4o achieves a zero-shot average accuracy of 71.3%; CoT boosts route-planning accuracy from 12.4% to 87.5%; and single-shot prompting raises map-based task accuracy to 76.3%. All evaluations are rigorously validated by human annotators. This work establishes a reproducible benchmark and methodological framework for assessing spatial intelligence in LLMs.

Technology Category

Application Category

📝 Abstract

The emergence of large language models such as ChatGPT, Gemini, and others highlights the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been thoroughly assessed. This study addresses this gap by introducing a new multi-task spatial evaluation dataset designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach. First, we conducted zero-shot testing. Then, we categorized the dataset by difficulty and performed prompt-tuning tests. Results show that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For instance, the Chain-of-Thought (CoT) strategy increased gpt-4o's accuracy in simple route planning from 12.4% to 87.5%, while a one-shot strategy improved moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.

Problem

Research questions and friction points this paper is trying to address.

Spatial Reasoning

Language Models

Navigation Tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial Reasoning

Multi-task Testing

Prompt Strategies

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning