🤖 AI Summary
This study investigates the true role and underlying mechanism of positional encoding in Vision Transformers (ViTs) with respect to spatial geometric reasoning. Addressing the lack of deep understanding regarding the geometric significance of positional encoding in existing literature, we propose a token-level multi-view geometric consistency diagnostic framework, which for the first time demonstrates that positional encoding acts as a causal factor in shaping the spatial structure of ViT representations. Through comprehensive ablation and probing experiments across 14 foundational ViT models, we validate that positional encoding simultaneously guides both local structural coherence and global layout organization, thereby establishing its essential role as a critical geometric prior in ViT architectures.
📝 Abstract
This paper revisits the role of positional embeddings (PEs) within vision transformers (ViTs) from a geometric perspective. We show that PEs are not mere token indices but effectively function as geometric priors that shape the spatial structure of the representation. We introduce token-level diagnostics that measure how multi-view geometric consistency in ViT representation depends on consitent PEs. Through extensive experiments on 14 foundation ViT models, we reveal how PEs influence multi-view geometry and spatial reasoning. Our findings clarify the role of PEs as a causal mechanism that governs spatial structure in ViT representations. Our code is provided in https://github.com/shijianjian/vit-geometry-probes