Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the significant limitations of current vision-language models (VLMs) in 3D spatial reasoning, particularly their underperformance in relative camera pose estimation compared to conventional 2D heuristic methods. To systematically evaluate VLMs’ 3D understanding capabilities, the authors introduce VRRPI-Bench, the first benchmark based on real-world first-person videos, along with a diagnostic dataset, VRRPI-Diag, designed to disentangle degrees of motion freedom. Experimental results reveal that state-of-the-art VLMs—including GPT-5—fall substantially short of geometric baselines (0.64 vs. 0.97) and human performance (0.92) on this task. Furthermore, these models exhibit poor multi-image reasoning consistency, achieving at most 59.7% agreement, highlighting fundamental deficiencies in handling depth variation and rotation about the optical axis.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) perform well in 2D perception and semantic reasoning compared to their limited understanding of 3D spatial structure. We investigate this gap using relative camera pose estimation (RCPE), a fundamental vision task that requires inferring relative camera translation and rotation from a pair of images. We introduce VRRPI-Bench, a benchmark derived from unlabeled egocentric videos with verbalized annotations of relative camera motion, reflecting realistic scenarios with simultaneous translation and rotation around a shared object. We further propose VRRPI-Diag, a diagnostic benchmark that isolates individual motion degrees of freedom. Despite the simplicity of RCPE, most VLMs fail to generalize beyond shallow 2D heuristics, particularly for depth changes and roll transformations along the optical axis. Even state-of-the-art models such as GPT-5 ($0.64$) fall short of classic geometric baselines ($0.97$) and human performance ($0.92$). Moreover, VLMs exhibit difficulty in multi-image reasoning, with inconsistent performance (best $59.7\%$) when integrating spatial cues across frames. Our findings reveal limitations in grounding VLMs in 3D and multi-view spatial reasoning.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Relative Camera Pose Estimation

3D Spatial Reasoning

Multi-view Reasoning

Spatial Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

relative camera pose estimation

vision-language models

3D spatial reasoning