Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing vision-language models commonly suffer from overfitting and structural rigidity in 3D spatial reasoning. This work proposes GASP, a novel framework that, for the first time, integrates point correspondence contrastive loss with depth consistency supervision as a geometric prior injected across Transformer layers, thereby enhancing internal geometric coherence without relying on 3D visual question answering data. By leveraging a lightweight correspondence head, contrastive learning, and multi-layer depth supervision, GASP effectively incorporates geometric ground truth from large-scale video scenes. Experiments demonstrate that the model achieves over 70% intra-layer correspondence accuracy and temporal robustness exceeding 85%, yielding performance gains of 18.2% and 29.0% on the All-Angles Bench and VSI-Bench benchmarks, respectively.

📝 Abstract

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

Problem

Research questions and friction points this paper is trying to address.

3D spatial reasoning

Vision-Language Models

geometric priors

visual question answering

spatial understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometric Priors

Vision-Language Models

3D Spatial Reasoning