Geometry without Position? When Positional Embeddings Help and Hurt Spatial Reasoning

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This study investigates the true role and underlying mechanism of positional encoding in Vision Transformers (ViTs) with respect to spatial geometric reasoning. Addressing the lack of deep understanding regarding the geometric significance of positional encoding in existing literature, we propose a token-level multi-view geometric consistency diagnostic framework, which for the first time demonstrates that positional encoding acts as a causal factor in shaping the spatial structure of ViT representations. Through comprehensive ablation and probing experiments across 14 foundational ViT models, we validate that positional encoding simultaneously guides both local structural coherence and global layout organization, thereby establishing its essential role as a critical geometric prior in ViT architectures.

Technology Category

Application Category

📝 Abstract

This paper revisits the role of positional embeddings (PEs) within vision transformers (ViTs) from a geometric perspective. We show that PEs are not mere token indices but effectively function as geometric priors that shape the spatial structure of the representation. We introduce token-level diagnostics that measure how multi-view geometric consistency in ViT representation depends on consitent PEs. Through extensive experiments on 14 foundation ViT models, we reveal how PEs influence multi-view geometry and spatial reasoning. Our findings clarify the role of PEs as a causal mechanism that governs spatial structure in ViT representations. Our code is provided in https://github.com/shijianjian/vit-geometry-probes

Problem

Research questions and friction points this paper is trying to address.

positional embeddings

vision transformers

spatial reasoning

multi-view geometry

geometric priors

Innovation

Methods, ideas, or system contributions that make the work stand out.

positional embeddings

vision transformers

geometric priors