MobQA: A Benchmark Dataset for Semantic Understanding of Human Mobility Data through Question Answering

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the limited capability of large language models (LLMs) in semantically interpreting human mobility data—particularly in explaining the underlying causes and deeper meanings of movement patterns. To this end, we introduce MobQA, the first question-answering benchmark specifically designed for this task. Methodologically, MobQA proposes the first spatiotemporal-semantic joint reasoning evaluation framework, comprising three question types: factual retrieval, logical inference, and open-ended explanation. The benchmark is constructed from real-world GPS trajectories via meticulous human annotation, integrating spatial, temporal, and semantic dimensions. Systematic evaluation reveals that while mainstream LLMs perform robustly on factual extraction, they exhibit significant limitations in semantic reasoning and long-trajectory interpretation, with performance deteriorating markedly as trajectory length increases. This work establishes a novel benchmark and analytical lens for behavioral understanding in mobile intelligence and embodied AI.

Technology Category

Application Category

📝 Abstract

This paper presents MobQA, a benchmark dataset designed to evaluate the semantic understanding capabilities of large language models (LLMs) for human mobility data through natural language question answering. While existing models excel at predicting human movement patterns, it remains unobvious how much they can interpret the underlying reasons or semantic meaning of those patterns. MobQA provides a comprehensive evaluation framework for LLMs to answer questions about diverse human GPS trajectories spanning daily to weekly granularities. It comprises 5,800 high-quality question-answer pairs across three complementary question types: factual retrieval (precise data extraction), multiple-choice reasoning (semantic inference), and free-form explanation (interpretive description), which all require spatial, temporal, and semantic reasoning. Our evaluation of major LLMs reveals strong performance on factual retrieval but significant limitations in semantic reasoning and explanation question answering, with trajectory length substantially impacting model effectiveness. These findings demonstrate the achievements and limitations of state-of-the-art LLMs for semantic mobility understanding.footnote{MobQA dataset is available at https://github.com/CyberAgentAILab/mobqa.}

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' semantic understanding of human mobility data

Assesses models' ability to interpret reasons behind movement patterns

Tests spatial, temporal, and semantic reasoning in mobility QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark dataset for LLM semantic understanding

Diverse question types for mobility data

Evaluates spatial, temporal, semantic reasoning

🔎 Similar Papers

Human Mobility Modeling with Limited Information via Large Language Models