🤖 AI Summary
This work addresses the limited capability of large language models (LLMs) in semantically interpreting human mobility data—particularly in explaining the underlying causes and deeper meanings of movement patterns. To this end, we introduce MobQA, the first question-answering benchmark specifically designed for this task. Methodologically, MobQA proposes the first spatiotemporal-semantic joint reasoning evaluation framework, comprising three question types: factual retrieval, logical inference, and open-ended explanation. The benchmark is constructed from real-world GPS trajectories via meticulous human annotation, integrating spatial, temporal, and semantic dimensions. Systematic evaluation reveals that while mainstream LLMs perform robustly on factual extraction, they exhibit significant limitations in semantic reasoning and long-trajectory interpretation, with performance deteriorating markedly as trajectory length increases. This work establishes a novel benchmark and analytical lens for behavioral understanding in mobile intelligence and embodied AI.
📝 Abstract
This paper presents MobQA, a benchmark dataset designed to evaluate the semantic understanding capabilities of large language models (LLMs) for human mobility data through natural language question answering.
While existing models excel at predicting human movement patterns, it remains unobvious how much they can interpret the underlying reasons or semantic meaning of those patterns. MobQA provides a comprehensive evaluation framework for LLMs to answer questions about diverse human GPS trajectories spanning daily to weekly granularities. It comprises 5,800 high-quality question-answer pairs across three complementary question types: factual retrieval (precise data extraction), multiple-choice reasoning (semantic inference), and free-form explanation (interpretive description), which all require spatial, temporal, and semantic reasoning. Our evaluation of major LLMs reveals strong performance on factual retrieval but significant limitations in semantic reasoning and explanation question answering, with trajectory length substantially impacting model effectiveness. These findings demonstrate the achievements and limitations of state-of-the-art LLMs for semantic mobility understanding.footnote{MobQA dataset is available at https://github.com/CyberAgentAILab/mobqa.}