Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that multimodal large language models struggle to effectively perceive physical spatial structures in real-world visual data, often suffering from loss of local details and semantic misalignment due to single-layer geometric extraction and early fusion strategies. To overcome these limitations, the authors propose GUIDE, a novel framework featuring an inter-layer unfolded geometric prior injection mechanism. It captures multi-granular geometric features—from local edges to global topology—through hierarchical sampling and progressively aligns them with early model representations to guide incremental learning of 2D-to-3D transformation. Additionally, a context-aware gating mechanism dynamically activates relevant spatial cues, enhancing the utilization efficiency of geometric priors while suppressing noise. Extensive experiments demonstrate that GUIDE significantly outperforms existing baselines on complex spatial reasoning tasks, establishing a new paradigm for integrating 3D geometric priors into multimodal large models.
📝 Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in 2D visual tasks but still exhibit limited physical spatial awareness when processing real-world visual streams. Recently, feed-forward geometric foundation models, which implicitly extract geometric priors, have provided a new pathway to address this issue. However, existing geometry-aware MLLMs are predominantly constrained by the paradigm of single deep-layer extraction and input-level fusion. This flattened fusion leads to the loss of local geometric details and causes semantic mismatches in the early layers. To break this bottleneck, we propose GUIDE (Geometric Unrolling Inside MLLM Early-layers), a progressive geometric priors injection framework. GUIDE performs multi-level sampling within the geometric encoder, comprehensively capturing multi-granularity features ranging from local edges to global topologies. Subsequently, we rigorously align and fuse these multi-level geometric priors step-by-step with the early layers of the MLLM. Building upon the injection of multi-granularity geometric information, this design guides the model to progressively learn the 2D-to-3D transitional process. Furthermore, we introduce a context-aware gating that enables the model to fetch requisite spatial cues based on current semantics, thereby maximizing the utilization efficiency of spatial priors and effectively suppressing redundant geometric noise. Extensive experiments demonstrate that GUIDE significantly outperforms existing baselines on multiple complex spatial reasoning and perception tasks, establishing a novel paradigm for integrating 3D geometric priors into large models.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Geometric Priors
Spatial Awareness
Early-layer Fusion
3D Perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric priors
multimodal LLMs
layer-wise fusion
multi-granularity features
context-aware gating
🔎 Similar Papers
No similar papers found.