MapTab: Can MLLMs Master Constrained Route Planning?

📅 2026-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to effectively evaluate the reasoning capabilities of multimodal large language models (MLLMs) under complex constraints, particularly in path-planning tasks that require integrating visual maps with structured tabular data. To address this gap, this work proposes MapTab—the first multimodal evaluation benchmark that combines map images with attribute-rich tables containing temporal, pricing, and other real-world factors. MapTab encompasses subway networks across 160 global cities and 168 tourist attractions, incorporating four practical constraints: time, cost, comfort, and reliability. Based on a large-scale, human-annotated multimodal question set, our evaluation across 15 prominent MLLMs reveals that current models perform poorly under constrained perception, with multimodal fusion often underperforming unimodal baselines—highlighting significant deficiencies in cross-modal alignment and constraint-aware reasoning.

Technology Category

Application Category

📝 Abstract
Systematic evaluation of Multimodal Large Language Models (MLLMs) is crucial for advancing Artificial General Intelligence (AGI). However, existing benchmarks remain insufficient for rigorously assessing their constrained reasoning capabilities. To bridge this gap, we introduce MapTab, a multimodal benchmark specifically designed to evaluate constrained reasoning in MLLMs via route planning tasks. MapTab requires MLLMs to perceive and ground visual cues from map images alongside route attributes (e.g., Time, Price) from structured tabular data. The benchmark encompasses two scenarios: Metromap, covering metro networks in 160 cities across 52 countries, and Travelmap, depicting 168 representative tourist attractions from 19 countries. In total, MapTab comprises 328 images, 196,800 route planning queries, and 3,936 QA queries, all incorporating 4 key constraints: Time, Price, Comfort, and Reliability. Extensive evaluations across 15 representative MLLMs reveal that current models face substantial challenges in constrained multimodal reasoning. Notably, under conditions of limited visual perception, multimodal collaboration often underperforms compared to unimodal approaches. We believe MapTab provides a challenging and realistic testbed to advance the systematic evaluation of MLLMs.
Problem

Research questions and friction points this paper is trying to address.

constrained reasoning
multimodal large language models
route planning
benchmark evaluation
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

constrained reasoning
multimodal large language models
route planning
benchmark
visual-tabular grounding
Z
Ziqiao Shang
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Intelligence Science and Technology, Nanjing University, Suzhou, China
L
Lingyue Ge
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Y
Yang Chen
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Shi-Yu Tian
Shi-Yu Tian
Nanjing University
machine learning
Zhenyu Huang
Zhenyu Huang
Sichuan University
Multimodal LearningRepresentation Learning
W
Wenbo Fu
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Yu-Feng Li
Yu-Feng Li
Professor, Nanjing University
Machine Learning
Lan-Zhe Guo
Lan-Zhe Guo
LAMDA Group, Nanjing University
Machine Learning