Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether vision-language models possess genuine structured 3D spatial understanding or merely rely on statistical shortcuts in images. To this end, we introduce a representation-level analysis framework and a synthetic benchmark, SpatialTunnel, which employs minimal contrastive pairs to reveal a systematic bias: models conflate vertical image position with physical distance—a phenomenon we term “vertical–distance entanglement.” This bias intensifies with larger training datasets, and distinct models exhibit markedly different internal representational structures despite achieving comparable task performance. We provide the first quantitative characterization of this entanglement and demonstrate that disentangling spatial axes substantially improves cross-task generalization and robustness.
📝 Abstract
Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
vision-language models
representation bias
perspective bias
shortcut learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial representation
vision-language models
representation disentanglement
synthetic benchmark
perspective bias
🔎 Similar Papers