Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing multimodal large language models, which passively fuse geometric information, leading to insufficient semantic-geometric alignment and redundant signals. To overcome this, the authors propose GeoThinker, a novel framework that employs an active perception mechanism to introduce spatially anchored fusion at specific vision-language layers, enabling the model to selectively integrate task-relevant geometric evidence on demand. The approach combines frame-aware cross-attention for semantic queries, importance gating, and a 3D geometric encoder to achieve task-driven geometric integration. Evaluated on VSI-Bench, GeoThinker achieves a new state-of-the-art score of 72.6, demonstrating significant improvements in spatial reasoning and generalization across complex scenarios such as embodied reference and autonomous driving.

Technology Category

Application Category

📝 Abstract
Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
geometric priors
passive fusion
semantic-geometry misalignment
Multimodal Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

active geometry integration
spatial reasoning
multimodal large language models
cross-attention
importance gating