Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited performance in complex 3D spatial reasoning tasks due to their reliance on 2D visual priors. This work proposes a training-free framework that leverages an explicit 3D reconstruction–driven visual chain-of-thought mechanism to guide MLLMs in generating high-fidelity 3D meshes from a single image. By iteratively optimizing camera viewpoints with an external knowledge base, the method emulates human-like multi-view reasoning. It achieves, for the first time, substantial improvements in MLLM 3D spatial understanding without fine-tuning, through two core innovations: multi-granularity keyword-guided 3D reconstruction and an active viewpoint exploration mechanism. These advances overcome critical limitations of existing tool-augmented approaches in geometric modeling fidelity and viewpoint flexibility. Experiments demonstrate that the proposed framework significantly outperforms both specialized spatial reasoning models and state-of-the-art multimodal LLMs—including GPT-5.2 and Gemini-2.5-Flash—on benchmarks such as 3DSRBench and Rel3D.
📝 Abstract
Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textit{GPT-5.2} and \textit{Gemini-2.5-Flash}, on major benchmarks such as 3DSRBench and Rel3D.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
3D spatial reasoning
spatial understanding
perspective-taking
3D scene understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
3D reconstruction
Visual Chain-of-Thought
multi-perspective reasoning
spatial understanding
🔎 Similar Papers
J
Jiahua Chen
Tsinghua University
Q
Qihong Tang
Nanjing University
Weinong Wang
Weinong Wang
Xian Jiaotong University
LLM/VLLM/RL
Qi Fan
Qi Fan
Nanjing University