🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit significant deficiencies in multi-step spatial reasoning—particularly in modeling cross-step geometric relationships and sequential logical deduction—hindering their deployment in real-world applications such as robotic manipulation and autonomous navigation. To address this, we introduce LEGO-Puzzles, the first LEGO-based visual question answering benchmark explicitly designed for multi-step spatial reasoning (1,100 samples), alongside a scalable evaluation framework that unifies geometric relation modeling with sequential logical inference. Our dual-task design integrates visual question answering (VQA) and instruction-driven image generation, supported by human-annotated ground truth and human performance baselines. Experiments reveal that state-of-the-art MLLMs achieve only ~50% average accuracy—substantially below human performance (>90%); only Gemini-2.0-Flash and GPT-4o demonstrate preliminary capability in instruction-following image generation. This work exposes a fundamental bottleneck in spatio-temporal joint reasoning and provides a standardized, reproducible evaluation toolkit for future research.
📝 Abstract
Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce extbf{LEGO-Puzzles}, a scalable benchmark designed to evaluate both extbf{spatial understanding} and extbf{sequential reasoning} in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90% accuracy. In addition to VQA tasks, we evaluate MLLMs' abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.