🤖 AI Summary
Existing vision-language models struggle to effectively leverage geometric information for spatial reasoning in both static and dynamic scenes. To address this limitation, this work proposes GeoSR, a framework that weakens 2D visual shortcuts through geometry-aware masking and adaptively enhances the contribution of geometric tokens in critical regions via a gated routing mechanism. GeoSR integrates geometric tokens—generated by a pretrained 3D foundation model—into the vision-language model using a masked integration strategy and gated fusion. Experimental results demonstrate that GeoSR achieves state-of-the-art performance across multiple benchmarks for spatial reasoning in both static and dynamic settings, significantly improving the model’s geometric perception and reasoning capabilities.
📝 Abstract
Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.