🤖 AI Summary
This work proposes Metric-CogMap and Cognitive Chain-of-Thought (Cog-CoT) to enhance spatial reasoning in 3D vision-language models by introducing the first interpretable, explicit 3D spatial reasoning mechanism. The approach integrates discrete grid-based and continuous metric space representations, enabling geometric reasoning through vector operations, bounding box distances, and occlusion-aware analysis. Remarkably, the model achieves 59.9% accuracy on VSI-Bench using only 50% of the labeled data—nearly matching the full-data baseline of 60.9%—and outperforms existing state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training data regimes, respectively. These results demonstrate a significant reduction in reliance on supervised annotations while maintaining competitive performance.
📝 Abstract
We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.