Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitations of traditional methods that decouple camera parameter estimation from novel-view video synthesis, which often struggle under sparse image observations or ambiguous poses. The authors propose a video diffusion model that jointly models video frames and camera trajectories for the first time, representing camera trajectories as dense ray pixels (raxels) and co-optimizing them with video frames during the denoising process. To enable this joint optimization, they introduce a decoupled self-cross attention mechanism and observe that trajectory prediction requires significantly fewer denoising steps than video generation. The approach supports closed-loop self-consistency validation and achieves strong performance in both camera pose estimation and camera-controlled video generation, demonstrating consistency between forward and inverse predictions.

Technology Category

Application Category

📝 Abstract

Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.

Problem

Research questions and friction points this paper is trying to address.

camera trajectory estimation

novel view synthesis

video generation

pose ambiguity

sparse image coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

joint distribution

video diffusion model

camera trajectory