🤖 AI Summary
Current vision-language models often flatten images into one-dimensional sequences, thereby discarding critical two-dimensional spatial structure and exhibiting significant deficiencies in understanding spatial relationships—limitations that hinder their applicability to embodied intelligence tasks such as robotics. This work systematically identifies and empirically validates this issue for the first time, proposing a novel architecture that replaces CLIP-style image encoding objectives with a non-CLIP training target and incorporates explicit two-dimensional positional encoding to preserve and recover spatial structure. Experiments demonstrate that the proposed approach substantially improves performance across multiple spatial reasoning benchmarks, effectively enhancing the model’s capacity for spatial understanding and localization.
📝 Abstract
Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.