Geometry without Position? When Positional Embeddings Help and Hurt Spatial Reasoning

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the true role and underlying mechanism of positional encoding in Vision Transformers (ViTs) with respect to spatial geometric reasoning. Addressing the lack of deep understanding regarding the geometric significance of positional encoding in existing literature, we propose a token-level multi-view geometric consistency diagnostic framework, which for the first time demonstrates that positional encoding acts as a causal factor in shaping the spatial structure of ViT representations. Through comprehensive ablation and probing experiments across 14 foundational ViT models, we validate that positional encoding simultaneously guides both local structural coherence and global layout organization, thereby establishing its essential role as a critical geometric prior in ViT architectures.

Technology Category

Application Category

📝 Abstract
This paper revisits the role of positional embeddings (PEs) within vision transformers (ViTs) from a geometric perspective. We show that PEs are not mere token indices but effectively function as geometric priors that shape the spatial structure of the representation. We introduce token-level diagnostics that measure how multi-view geometric consistency in ViT representation depends on consitent PEs. Through extensive experiments on 14 foundation ViT models, we reveal how PEs influence multi-view geometry and spatial reasoning. Our findings clarify the role of PEs as a causal mechanism that governs spatial structure in ViT representations. Our code is provided in https://github.com/shijianjian/vit-geometry-probes
Problem

Research questions and friction points this paper is trying to address.

positional embeddings
vision transformers
spatial reasoning
multi-view geometry
geometric priors
Innovation

Methods, ideas, or system contributions that make the work stand out.

positional embeddings
vision transformers
geometric priors
spatial reasoning
multi-view consistency
🔎 Similar Papers
No similar papers found.
J
Jian Shi
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
M
Michael Birsak
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
W
Wenqing Cui
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Zhenyu Li
Zhenyu Li
PhD student at KAUST
Computer Vision
Peter Wonka
Peter Wonka
King Abdullah University of Science and Technology (KAUST)
Deep LearningComputer VisionComputer GraphicsMachine LearningRemote Sensing