MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of systematic and reproducible evaluation frameworks for large language model (LLM)-driven route planning agents in real-world mobility scenarios. To this end, we introduce MobilityBench, the first benchmark built upon large-scale, real user queries spanning multiple cities worldwide and capturing diverse travel intents. We further develop a deterministic API replay sandbox and a multidimensional evaluation protocol that enables end-to-end assessment across five key dimensions: result validity, instruction comprehension, planning capability, tool utilization, and computational efficiency. Experimental results demonstrate that while current LLMs perform adequately on basic navigation tasks, they exhibit significant limitations in generating personalized routes under complex user preference constraints, thereby highlighting critical challenges and providing both direction and infrastructure for future research.

Technology Category

Application Category

📝 Abstract
Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .
Problem

Research questions and friction points this paper is trying to address.

route-planning agents
large language models
real-world mobility
benchmark evaluation
personalized mobility
Innovation

Methods, ideas, or system contributions that make the work stand out.

MobilityBench
route-planning agents
LLM-based evaluation
deterministic API-replay sandbox
multi-dimensional evaluation
🔎 Similar Papers
No similar papers found.
Z
Zhiheng Song
Computer Network Information Center, Chinese Academy of Sciences; AMAP, Alibaba Group
J
Jingshuai Zhang
AMAP, Alibaba Group
Chuan Qin
Chuan Qin
CNIC, Chinese Academy of Sciences
Knowledge ComputingRepresentation Learning
Chao Wang
Chao Wang
University of Science and Technology of China
Recommender systemData miningMachine learning
Chao Chen
Chao Chen
PhD, Alibaba Group << Zhejiang University
LLMsMachine LearningComputer Vision
L
Longfei Xu
AMAP, Alibaba Group
K
Kaikui Liu
AMAP, Alibaba Group
X
Xiangxiang Chu
AMAP, Alibaba Group
H
Hengshu Zhu
Computer Network Information Center, Chinese Academy of Sciences