Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Current vision-language models (VLMs) exhibit significant limitations in egocentric multi-view 3D spatial relation understanding. To address this, we introduce Ego3D-Bench—the first real-world-oriented benchmark for 3D spatial understanding—and propose Ego3D-VLM, a post-training framework designed to enhance VLMs’ 3D reasoning capabilities. Ego3D-VLM processes multi-view inputs to estimate global 3D coordinates and constructs a cognitive map enabling cross-view spatial inference; its modular architecture allows plug-and-play adaptation to arbitrary VLMs. Trained and evaluated on high-quality human-annotated spatial question-answering data, Ego3D-VLM is validated across 16 state-of-the-art VLMs. Results show an average 12% improvement in multiple-choice question-answering accuracy and a 56% reduction in absolute 3D distance estimation error, substantially narrowing the performance gap with human annotators.

Technology Category

Application Category

📝 Abstract

Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' spatial reasoning in ego-centric multi-view scenes

Bridging performance gap between VLMs and human spatial understanding

Enhancing 3D spatial reasoning through modular post-training framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Ego3D-Bench benchmark for multi-view spatial evaluation

Ego3D-VLM framework with cognitive map generation

Modular post-training integration with existing VLMs

🔎 Similar Papers

No similar papers found.