🤖 AI Summary
This work addresses the lack of systematic evaluation of multi-hop spatial reasoning (MSR) capabilities in dynamic videos within existing benchmarks. To this end, we propose Video-MSR, the first benchmark specifically designed for video-based MSR, encompassing four distinct task categories, and introduce MSR-9K, a high-quality instruction-tuning dataset. We develop a scalable data construction pipeline that combines model-generated content with human verification to ensure reliability. Leveraging MSR-9K, we perform instruction tuning on prominent multimodal large language models (MLLMs), including Qwen-VL. Comprehensive evaluation across 20 MLLMs reveals significant deficiencies in handling complex spatial reasoning chains. Notably, Qwen-VL fine-tuned on MSR-9K achieves an absolute performance gain of 7.82% on the Video-MSR benchmark, underscoring the effectiveness of our approach.
📝 Abstract
Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR.