RayRoPE: Projective Ray Positional Encoding for Multi-view Attention

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing positional encoding methods struggle to simultaneously satisfy uniqueness, SE(3) invariance, multi-frequency similarity, and geometric adaptability in multi-view Transformers. This work proposes RayRoPE, the first approach that leverages predicted 3D points along rays to construct geometry-aware positional encodings. By formulating positional embeddings in the query coordinate frame using projected coordinates, RayRoPE establishes an SE(3)-invariant multi-frequency attention mechanism. The method enables analytically computing expected encodings under depth uncertainty and seamlessly integrates RGB-D inputs. Evaluated on the CO3D dataset, RayRoPE significantly outperforms existing approaches in both novel view synthesis and stereo depth estimation, achieving a 15% relative improvement in LPIPS metric.

Technology Category

Application Category

📝 Abstract

We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows SE(3)-invariant attention with multi-frequency similarity, and can be adaptive to the geometry of the underlying scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet the above desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays but leverages a predicted point along the ray instead of the direction for a geometry-aware encoding. To achieve SE(3) invariance, RayRoPE computes query-frame projective coordinates for computing multi-frequency similarity. Lastly, as the'predicted'3D point along a ray may not be precise, RayRoPE presents a mechanism to analytically compute the expected position encoding under uncertainty. We validate RayRoPE on the tasks of novel-view synthesis and stereo depth estimation and show that it consistently improves over alternate position encoding schemes (e.g. 15% relative improvement on LPIPS in CO3D). We also show that RayRoPE can seamlessly incorporate RGB-D input, resulting in even larger gains over alternatives that cannot positionally encode this information.

Problem

Research questions and friction points this paper is trying to address.

positional encoding

multi-view attention

SE(3) invariance

geometry-aware

transformer

Innovation

Methods, ideas, or system contributions that make the work stand out.

RayRoPE

multi-view attention

SE(3) invariance