Cameras as Relative Positional Encoding

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Limited 3D perception in multi-view vision tasks stems from insufficient camera geometry modeling. To address this, we propose Projective Positional Encoding (PRoPE), the first method to encode the full camera intrinsic and extrinsic parameters—defining the frustum geometry—as relative positional encodings within Transformers. PRoPE jointly integrates token-level ray-map encoding, attention-level relative pose encoding, and geometrically grounded positional encoding to explicitly model cross-view spatial relationships in self-attention. Crucially, it supports generalization across varying sequence lengths, diverse intrinsic parameter distributions, and out-of-distribution (OOD) camera configurations. Extensive experiments on multi-view image synthesis and stereo depth estimation demonstrate consistent performance gains across model scales; improvements are especially pronounced for long sequences, unseen intrinsics, and OOD scenarios. These results validate the broad efficacy of geometry-aware positional encoding for enhancing multi-view Transformers.

Technology Category

Application Category

📝 Abstract

Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry to ground visual tokens in 3D space. In this work, we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we propose -- Projective Positional Encoding (PRoPE) -- that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding. Our experiments begin by showing how relative camera conditioning improves performance in feedforward novel view synthesis, with further gains from PRoPE. This holds across settings: scenes with both shared and varying intrinsics, when combining token- and attention-level conditioning, and for generalization to inputs with out-of-distribution sequence lengths and camera intrinsics. We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative spatial cognition, as well as larger model sizes.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-view transformers with camera geometry for 3D perception

Improving novel view synthesis via relative camera conditioning techniques

Validating camera encoding benefits across diverse tasks and model sizes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Projective Positional Encoding (PRoPE)

Encodes camera frustums as relative positional encoding

Improves multi-view transformer performance across tasks

🔎 Similar Papers

No similar papers found.