🤖 AI Summary
Visual-language models (VLMs) face two key bottlenecks in 3D spatial reasoning: insufficient 3D understanding stemming from 2D pretraining, and reasoning failure induced by redundant 3D information. To address these, we propose MSSR—a dual-agent framework—and SOG, a spatial orientation grounding module—introducing the “minimal sufficiency” principle to VLM-based spatial reasoning for the first time. Our method employs programmatic querying, a 3D-aware toolbox, and a closed-loop iterative mechanism of information pruning and completion to automatically construct task-driven, minimal-sufficient information sets. This significantly improves both reasoning accuracy and interpretability. On two challenging 3D spatial reasoning benchmarks, our approach achieves state-of-the-art performance while generating high-quality, structured reasoning traces. These interpretable, faithful reasoning paths serve as reliable supervisory signals and high-fidelity training data for downstream models.
📝 Abstract
Spatial reasoning, the ability to ground language in 3D understanding, remains a persistent challenge for Vision-Language Models (VLMs). We identify two fundamental bottlenecks: inadequate 3D understanding capabilities stemming from 2D-centric pre-training, and reasoning failures induced by redundant 3D information. To address these, we first construct a Minimal Sufficient Set (MSS) of information before answering a given question: a compact selection of 3D perception results from extit{expert models}. We introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that implements this principle. A Perception Agent programmatically queries 3D scenes using a versatile perception toolbox to extract sufficient information, including a novel SOG (Situated Orientation Grounding) module that robustly extracts language-grounded directions. A Reasoning Agent then iteratively refines this information to pursue minimality, pruning redundant details and requesting missing ones in a closed loop until the MSS is curated. Extensive experiments demonstrate that our method, by explicitly pursuing both sufficiency and minimality, significantly improves accuracy and achieves state-of-the-art performance across two challenging benchmarks. Furthermore, our framework produces interpretable reasoning paths, offering a promising source of high-quality training data for future models. Source code is available at https://github.com/gyj155/mssr.