URoPE: Universal Relative Position Embedding across Geometric Spaces

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
Existing relative position encoding methods are constrained by fixed geometric structures—such as 1D sequences or regular grids—and struggle to support vision tasks requiring cross-view and cross-dimensional reasoning. This work proposes URoPE, a universal, parameter-free, intrinsics-aware, and coordinate-system-agnostic extension of rotary position encoding. URoPE unifies relative positional modeling across 2D–2D, 2D–3D, and temporal dimensions by sampling 3D points along camera rays and projecting them onto the image plane. Compatible with standard RoPE attention kernels, URoPE significantly enhances Transformer performance across diverse tasks including novel view synthesis, 3D object detection, object tracking, and depth estimation, demonstrating its effectiveness and broad applicability.

Technology Category

Application Category

📝 Abstract
Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids, which restricts their applicability to many computer vision tasks that require geometric reasoning across camera views or between 2D and 3D spaces. To address this limitation, we propose URoPE, a universal extension of Rotary Position Embedding (RoPE) to cross-view or cross-dimensional geometric spaces. For each key/value image patch, URoPE samples 3D points along the corresponding camera ray at predefined depth anchors and projects them into the query image plane. Standard 2D RoPE can then be applied using the projected pixel coordinates. URoPE is a parameter-free and intrinsics-aware relative position embedding that is invariant to the choice of global coordinate systems, while remaining fully compatible with existing RoPE-optimized attention kernels. We evaluate URoPE as a plug-in positional encoding for transformer architectures across a diverse set of tasks, including novel view synthesis, 3D object detection, object tracking, and depth estimation, covering 2D-2D, 2D-3D, and temporal scenarios. Experiments show that URoPE consistently improves the performance of transformer-based models across all tasks, demonstrating its effectiveness and generality for geometric reasoning. Our project website is: https://urope-pe.github.io/.
Problem

Research questions and friction points this paper is trying to address.

relative position embedding
geometric reasoning
cross-view
cross-dimensional
Transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Relative Position Embedding
Rotary Position Embedding
Cross-view Geometry
3D-aware Attention
Parameter-free Position Encoding
🔎 Similar Papers
No similar papers found.