MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) lack robust multi-image spatial reasoning capabilities, while prevailing benchmarks evaluate only single-image relational understanding—failing to reflect real-world physical scene requirements. Method: We introduce MultiSpace-VQA, the first VQA benchmark explicitly designed for multi-image spatial intelligence, comprising 1,000 high-difficulty, unambiguous multiple-choice questions, each accompanied by fine-grained distractors and stepwise reasoning annotations. It is constructed by 3D domain experts with rigorous multi-stage quality control. Contributions/Results: (1) First systematic definition and evaluation of multi-image spatial intelligence; (2) An automated diagnostic framework identifying four canonical spatial reasoning failure modes; (3) A human-level chain-of-thought annotation and error attribution system. Evaluating 34 state-of-the-art MLLMs, we find the best open-source model achieves only 30% accuracy, OpenAI o3 reaches 40%, whereas humans attain 97%, exposing fundamental deficits in spatial grounding, scene reconstruction, and logical inference.

Technology Category

Application Category

📝 Abstract
Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a step-by-step reasoning process. We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering valuable insights for advancing multi-image spatial intelligence. Project page: https://runsenxu.com/projects/MMSI_Bench .
Problem

Research questions and friction points this paper is trying to address.

Assessing multi-image spatial reasoning in MLLMs
Benchmarking 1,000 challenging multi-image spatial questions
Diagnosing failure modes in multi-image spatial intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for multi-image spatial reasoning
Automated error analysis pipeline
Extensive evaluation of 34 MLLMs
🔎 Similar Papers
No similar papers found.