Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of multi-hop spatial reasoning (MSR) capabilities in dynamic videos within existing benchmarks. To this end, we propose Video-MSR, the first benchmark specifically designed for video-based MSR, encompassing four distinct task categories, and introduce MSR-9K, a high-quality instruction-tuning dataset. We develop a scalable data construction pipeline that combines model-generated content with human verification to ensure reliability. Leveraging MSR-9K, we perform instruction tuning on prominent multimodal large language models (MLLMs), including Qwen-VL. Comprehensive evaluation across 20 MLLMs reveals significant deficiencies in handling complex spatial reasoning chains. Notably, Qwen-VL fine-tuned on MSR-9K achieves an absolute performance gain of 7.82% on the Video-MSR benchmark, underscoring the effectiveness of our approach.

Technology Category

Application Category

📝 Abstract

Spatial reasoning has emerged as a critical capability for Multimodal Large Language Models (MLLMs), drawing increasing attention and rapid advancement. However, existing benchmarks primarily focus on single-step perception-to-judgment tasks, leaving scenarios requiring complex visual-spatial logical chains significantly underexplored. To bridge this gap, we introduce Video-MSR, the first benchmark specifically designed to evaluate Multi-hop Spatial Reasoning (MSR) in dynamic video scenarios. Video-MSR systematically probes MSR capabilities through four distinct tasks: Constrained Localization, Chain-based Reference Retrieval, Route Planning, and Counterfactual Physical Deduction. Our benchmark comprises 3,052 high-quality video instances with 4,993 question-answer pairs, constructed via a scalable, visually-grounded pipeline combining advanced model generation with rigorous human verification. Through a comprehensive evaluation of 20 state-of-the-art MLLMs, we uncover significant limitations, revealing that while models demonstrate proficiency in surface-level perception, they exhibit distinct performance drops in MSR tasks, frequently suffering from spatial disorientation and hallucination during multi-step deductions. To mitigate these shortcomings and empower models with stronger MSR capabilities, we further curate MSR-9K, a specialized instruction-tuning dataset, and fine-tune Qwen-VL, achieving a +7.82% absolute improvement on Video-MSR. Our results underscore the efficacy of multi-hop spatial instruction data and establish Video-MSR as a vital foundation for future research. The code and data will be available at https://github.com/ruiz-nju/Video-MSR.

Problem

Research questions and friction points this paper is trying to address.

Multi-hop Spatial Reasoning

Multimodal Large Language Models

Video Benchmark

Spatial Reasoning

Dynamic Video Scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-hop Spatial Reasoning

Video Benchmark

Multimodal Large Language Models