Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) lack standardized evaluation for spatial and sequential reasoning capabilities. Method: We introduce the Rubik’s Cube Benchmark—the first standardized, cube-solving-oriented benchmark—using controllable-complexity cube states as a unified visual-symbolic testbed. It systematically evaluates five core competencies: spatial perception, single-step decision-making, forward simulation, multi-step planning, and self-correction, with a Hamming-distance-based metric quantifying proximity to the goal state. Our methodology employs joint image-text inputs, standardized prompting templates, deterministic action parsing, and a reflective self-correction mechanism. Contribution/Results: Experiments across seven mainstream MLLMs reveal a significant capability gap between closed- and open-source models: performance sharply declines with increasing scramble depth; high reconstruction accuracy does not imply action validity; and all models exhibit severe degradation under high complexity.

Technology Category

Application Category

📝 Abstract

We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluates spatial and sequential reasoning in multimodal large language models

Measures skills like cube reconstruction, move prediction, and error recovery

Assesses model performance degradation with increasing problem complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubik's-cube benchmark for spatial reasoning evaluation

Decomposes performance into five distinct reasoning skills

Uses shared scrambled states and single metric for comparison

🔎 Similar Papers

No similar papers found.