Can Large Vision Language Models Read Maps Like a Human?

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work investigates the capability of Large Vision-Language Models (LVLMs) to interpret pixel-level outdoor maps and generate natural-language navigation instructions. To this end, we introduce MapBench—the first outdoor navigation benchmark explicitly designed for human-readable maps—comprising 100 real-world maps and over 1,600 path-following tasks. We propose the Map Spatial Scene Graph (MSSG) as a cross-modal alignment index for fine-grained evaluation, and design a cognitively decomposed Chain-of-Thought (CoT) reasoning framework to systematically expose fundamental limitations of LVLMs in spatial reasoning and structured decision-making. Through zero-shot prompting, MSSG-guided inference, and multi-granularity evaluation, we comprehensively assess leading LVLMs, revealing an average task accuracy below 35%, confirming MapBench’s high difficulty. The benchmark dataset, evaluation code, and implementation are publicly released.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LVLMs on human-readable map navigation tasks

Introduces MapBench dataset for complex outdoor navigation scenarios

Assesses spatial reasoning and decision-making in LVLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MapBench dataset for map-based navigation

Uses MSSG for natural language conversion and evaluation

Tests LVLMs with zero-shot and CoT reasoning frameworks

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment