MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) lack systematic evaluation of spatial understanding in continuous video. Method: We introduce V-Spatial, the first comprehensive, fully human-annotated benchmark explicitly designed for video-based spatial intelligence. It spans four hierarchical capabilities—perception, planning, prediction, and cross-video reasoning—built upon 1,278 video clips and 1,106 expert-validated 3D vision questions. V-Spatial pioneers three novel sub-benchmarks: geometric reasoning, motion localization, and cross-video spatial association. Contribution/Results: Through fine-grained error analysis and decoupled evaluation across 25 state-of-the-art MLLMs, we reveal severe spatial deficiencies: the best-performing model lags human performance by nearly 60%, while most perform near chance level. Neither 3D cue injection nor chain-of-thought prompting improves spatial reasoning; moreover, frame sampling strategies and spatial fine-tuning fail to generalize across tasks.

Technology Category

Application Category

📝 Abstract
Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive benchmark for video-based spatial intelligence in MLLMs
Need to assess spatial understanding across perception, planning, prediction, and reasoning
Existing models show significant performance gap compared to human spatial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-annotated benchmark for video spatial intelligence
Four-level framework: Perception, Planning, Prediction, Cross-Video Reasoning
Evaluates 25 MLLMs revealing significant human-AI performance gap
🔎 Similar Papers
J
Jingli Lin
Shanghai AI Laboratory
Runsen Xu
Runsen Xu
The Chinese University of Hong Kong
3D Computer VisionRoboticsDeep Learning
S
Shaohao Zhu
Shanghai AI Laboratory
Sihan Yang
Sihan Yang
Xi’an Jiaotong University
Medical image analysisMultimodal large language model
P
Peizhou Cao
Shanghai AI Laboratory
Y
Yunlong Ran
Shanghai AI Laboratory
M
Miao Hu
Xi’an Jiaotong University
Chenming Zhu
Chenming Zhu
The University of Hong Kong
Multimodal Large Language Model3D Vision
Y
Yiman Xie
Shanghai AI Laboratory
Y
Yilin Long
Shanghai AI Laboratory
W
Wenbo Hu
Shanghai AI Laboratory
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics
Tai Wang
Tai Wang
Shanghai AI Laboratory
Computer Vision3D VisionEmbodied AIDeep Learning
J
Jiangmiao Pang
Shanghai AI Laboratory