VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video understanding benchmarks predominantly focus on indoor or short-range scenarios, failing to assess multimodal large language models’ (MLLMs) capability to comprehend complex spatiotemporal trajectories encountered during long-distance travel. Method: To address this gap, we introduce VIR-Bench—the first benchmark dedicated to long-distance travel—comprising 200 authentic travel videos and a novel itinerary reconstruction task designed to systematically evaluate MLLMs’ structured reasoning across diverse spatial and temporal scales. We propose a fine-grained spatiotemporal alignment evaluation framework and, informed by empirical insights, develop a travel planning prototype agent. Results: Experiments reveal limited performance of state-of-the-art MLLMs on this task, confirming its challenge; our enhanced agent significantly improves itinerary recommendation quality. VIR-Bench thus establishes a critical foundation for advancing embodied AI and real-world navigation applications.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' geospatial-temporal understanding via travel itinerary reconstruction
Addressing lack of benchmarks for long-distance travel video understanding
Improving itinerary recommendations through enhanced geospatial-temporal intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing VIR-Bench for geospatial-temporal video evaluation
Framing itinerary reconstruction as a core challenging task
Developing a prototype agent for improved travel planning
🔎 Similar Papers
No similar papers found.
H
Hao Wang
Waseda University
E
Eiki Murata
CyberAgent, Inc.
L
Lingfang Zhang
Waseda University
A
Ayako Sato
CyberAgent, Inc.
S
So Fukuda
Waseda University
Ziqi Yin
Ziqi Yin
Jilin University
unsupervised domain adaptation、prompt learning
Wentao Hu
Wentao Hu
PhD student, The Hong Kong Polytechnic University
Large Language ModelComputer Vision
K
Keisuke Nakao
Waseda University
Y
Yusuke Nakamura
Waseda University
S
Sebastian Zwirner
Waseda University
Y
Yi-Chia Chen
Waseda University
H
Hiroyuki Otomo
CyberAgent, Inc.
Hiroki Ouchi
Hiroki Ouchi
Nara Institute of Science and Technology
Natural Language ProcessingMachine Learning
Daisuke Kawahara
Daisuke Kawahara
Waseda University
Computational LinguisticsNatural Language Processing