LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit significant deficiencies in multi-step spatial reasoning—particularly in modeling cross-step geometric relationships and sequential logical deduction—hindering their deployment in real-world applications such as robotic manipulation and autonomous navigation. To address this, we introduce LEGO-Puzzles, the first LEGO-based visual question answering benchmark explicitly designed for multi-step spatial reasoning (1,100 samples), alongside a scalable evaluation framework that unifies geometric relation modeling with sequential logical inference. Our dual-task design integrates visual question answering (VQA) and instruction-driven image generation, supported by human-annotated ground truth and human performance baselines. Experiments reveal that state-of-the-art MLLMs achieve only ~50% average accuracy—substantially below human performance (>90%); only Gemini-2.0-Flash and GPT-4o demonstrate preliminary capability in instruction-following image generation. This work exposes a fundamental bottleneck in spatio-temporal joint reasoning and provides a standardized, reproducible evaluation toolkit for future research.

Technology Category

Application Category

📝 Abstract

Multi-step spatial reasoning entails understanding and reasoning about spatial relationships across multiple sequential steps, which is crucial for tackling complex real-world applications, such as robotic manipulation, autonomous navigation, and automated assembly. To assess how well current Multimodal Large Language Models (MLLMs) have acquired this fundamental capability, we introduce extbf{LEGO-Puzzles}, a scalable benchmark designed to evaluate both extbf{spatial understanding} and extbf{sequential reasoning} in MLLMs through LEGO-based tasks. LEGO-Puzzles consists of 1,100 carefully curated visual question-answering (VQA) samples spanning 11 distinct tasks, ranging from basic spatial understanding to complex multi-step reasoning. Based on LEGO-Puzzles, we conduct a comprehensive evaluation of state-of-the-art MLLMs and uncover significant limitations in their spatial reasoning capabilities: even the most powerful MLLMs can answer only about half of the test cases, whereas human participants achieve over 90% accuracy. In addition to VQA tasks, we evaluate MLLMs' abilities to generate LEGO images following assembly illustrations. Our experiments show that only Gemini-2.0-Flash and GPT-4o exhibit a limited ability to follow these instructions, while other MLLMs either replicate the input image or generate completely irrelevant outputs. Overall, LEGO-Puzzles exposes critical deficiencies in existing MLLMs' spatial understanding and sequential reasoning capabilities, and underscores the need for further advancements in multimodal spatial reasoning.

Problem

Research questions and friction points this paper is trying to address.

Assessing MLLMs' multi-step spatial reasoning abilities

Evaluating spatial understanding and sequential reasoning via LEGO-Puzzles

Identifying limitations in MLLMs' spatial and multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

LEGO-Puzzles benchmark for MLLM evaluation

1,100 VQA samples for spatial reasoning

Tests MLLMs on multi-step sequential tasks

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?