Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models exhibit limitations in spatial understanding and viewpoint awareness, hindering accurate language-guided localization and 3D scene reasoning. This work proposes a novel approach that, for the first time, integrates human-like spatial cognition mechanisms into vision-language modeling. By jointly optimizing global 3D layout reconstruction and explicit situational modeling—augmented with lightweight camera pose priors to enforce geometric consistency and scale alignment—the method achieves state-of-the-art performance on language-guided localization tasks. Relying solely on monocular video input, it leverages pose estimates from pretrained 3D foundation models and egocentric viewpoint modeling. The approach significantly outperforms existing 2D and video-based methods on both situated and general 3D question-answering benchmarks.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm
Problem

Research questions and friction points this paper is trying to address.

spatial understanding
viewpoint-aware reasoning
3D reasoning
language-based localization
vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D reasoning
vision-language models
spatial understanding
monocular video
language-based localization
🔎 Similar Papers
No similar papers found.