TurnBack: A Geospatial Route Cognition Benchmark for Large Language Models through Reverse Route

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Prior work lacks a quantitative, systematic evaluation of large language models’ (LLMs) geospatial route reasoning capabilities—particularly for route reversal. Method: We introduce the first large-scale benchmark for route inverse reasoning, comprising 36,000 real-world routes across 12 global megacities; develop PathBuilder, a tool enabling bidirectional translation between natural language and navigational paths; and propose a tiered quantitative evaluation framework measuring path reconstruction accuracy, robustness, and confidence calibration. Results: Evaluation across 11 state-of-the-art LLMs reveals widespread failure in generating correct reverse paths—characterized by high-confidence erroneous outputs and low path similarity to ground truth. This work bridges a critical gap in geospatial reasoning assessment and establishes a reproducible, scalable evaluation paradigm for LLMs’ spatial semantic understanding.

Technology Category

Application Category

📝 Abstract

Humans can interpret geospatial information through natural language, while the geospatial cognition capabilities of Large Language Models (LLMs) remain underexplored. Prior research in this domain has been constrained by non-quantifiable metrics, limited evaluation datasets and unclear research hierarchies. Therefore, we propose a large-scale benchmark and conduct a comprehensive evaluation of the geospatial route cognition of LLMs. We create a large-scale evaluation dataset comprised of 36000 routes from 12 metropolises worldwide. Then, we introduce PathBuilder, a novel tool for converting natural language instructions into navigation routes, and vice versa, bridging the gap between geospatial information and natural language. Finally, we propose a new evaluation framework and metrics to rigorously assess 11 state-of-the-art (SOTA) LLMs on the task of route reversal. The benchmark reveals that LLMs exhibit limitation to reverse routes: most reverse routes neither return to the starting point nor are similar to the optimal route. Additionally, LLMs face challenges such as low robustness in route generation and high confidence for their incorrect answers. Code & Data available here: href{https://github.com/bghjmn32/EMNLP2025_Turnback}{TurnBack.}

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to reverse geospatial routes accurately

Assessing limitations in route cognition using large-scale benchmark data

Measuring robustness issues and overconfidence in incorrect route generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale benchmark with 36000 routes

PathBuilder tool converts language to routes

New evaluation framework assesses route reversal

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning