MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit significant limitations in complex spatial reasoning and planning tasks, primarily due to the absence of rigorous, step-wise cross-modal reasoning benchmarks. To address this gap, we introduce M-Bench—the first benchmark explicitly designed to evaluate multi-step reasoning and planning under visual-spatial constraints. It comprises two challenging tasks: M-Portal (spatial path planning with portal-based teleportation) and M-Cube (3D physical state evolution inference). Evaluated on 12 state-of-the-art MLLMs, all models perform near-chance on M-Portal and fail entirely on M-Cube; only simplified variants yield marginal improvements over baselines. These results expose dual bottlenecks in joint perception-reasoning modeling—specifically, insufficient cross-modal alignment and inadequate sequential logical planning. M-Bench thus provides both a critical evaluation framework and empirical evidence to guide the development of next-generation architectures requiring deep multimodal coordination and iterative, structured reasoning.

Technology Category

Application Category

📝 Abstract

The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans under spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE -- all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.

Problem

Research questions and friction points this paper is trying to address.

Assessing multimodal spatial reasoning and planning in AI models

Evaluating step-by-step complex reasoning in multimodal domains

Identifying perception limitations in multimodal language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for complex reasoning tasks

Spatial and visual constraints in planning

Evaluates step-by-step reasoning in MLLMs

🔎 Similar Papers

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning