CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing video generation methods struggle to robustly model camera motion and scene geometry under non-pinhole camera models such as wide-angle or fisheye lenses, limiting their performance under unified camera control. This work proposes CRePE, the first approach to explicitly encode the geometric projection paths of non-pinhole cameras into positional representations, modeling image tokens as depth-aware positional distributions along rays within a unified camera framework. By integrating monocular geometric priors via a geometric attention adapter and leveraging a Radial MixForcing mechanism, CRePE enables geometry-conditioned video generation within a frozen video DiT architecture. Experiments demonstrate that CRePE significantly outperforms existing methods—such as RayRoPE—in both geometric consistency and perceptual quality, while preserving high-fidelity video synthesis capabilities.

📝 Abstract

Camera-conditioned video generation requires positional encoding that remains reliable under changes in camera motion, lens configuration, and scene structure. However, existing attention-level camera encodings either provide ray-only camera signals or rely on pinhole camera geometry, limiting their applicability to general camera control under the Unified Camera Model, including wide-angle and fisheye lenses. To address this limitation, we propose Curved Ray Expectation Positional Encoding (CRePE). CRePE represents each image token as a depth-aware positional distribution along its source ray, providing a Unified Camera Model-compatible positional encoding that captures the projected-path geometry induced by wide-angle and fisheye cameras. CRePE is implemented through a Geometric Attention Adapter added to frozen video DiTs, injecting token-wise scene-distance information into selected attention layers and stabilizing it with pseudo supervision from a monocular geometry foundation model. This design leads to more stable camera control and improves several geometry-aware and perceptual-quality metrics, while remaining competitive on video-quality metrics. Controlled positional-encoding ablations show a better overall average rank than a RayRoPE-style endpoint PE baseline, demonstrating the effectiveness of UCM-aware projected-path integration across diverse camera models. Furthermore, by extending the same positional-encoding pathway to external geometry control through Radial MixForcing, CRePE supports external radial-map control for scene-geometry-conditioned generation and source-video motion transfer beyond camera control.

Problem

Research questions and friction points this paper is trying to address.

camera-conditioned video generation

positional encoding

Unified Camera Model

wide-angle lenses

fisheye lenses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curved Ray Expectation Positional Encoding

Unified Camera Model

Geometric Attention Adapter