GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

Current vision-language models lack differentiated geometric modeling of the spatial roles of visual tokens in spatiotemporal reasoning, resulting in coarse-grained sharing of geometric cues and limiting fine-grained spatial understanding. This work proposes GeoWeaver, a framework that, for the first time, treats geometric information as a foundational prior representation for visual tokens rather than a post-hoc auxiliary signal. GeoWeaver constructs a multi-level geometric repository using a frozen geometry encoder and introduces a token-adaptive mechanism for allocating geometric evidence along with a residual grounding operation to dynamically integrate geometric cues prior to reasoning. Experiments demonstrate that GeoWeaver substantially enhances geometric awareness across multiple spatial reasoning benchmarks while preserving general multimodal performance, thereby validating the efficacy of geometric priors in improving spatial reasoning capabilities of large language models.

📝 Abstract

Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at https://github.com/yahooo-m/GeoWeaver .

Problem

Research questions and friction points this paper is trying to address.

spatio-temporal reasoning

visual tokens

geometric evidence

geometry grounding

vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric grounding

token-adaptive allocation

spatio-temporal reasoning