CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing video diffusion models suffer from inadequate camera alignment, high computational cost in reward calculation, and neglect of 3D geometric information. To address these limitations, this work proposes a camera-aware 3D Gaussian decoding mechanism that jointly decodes video latent representations and camera poses into a 3D Gaussian representation. An efficient reward signal is constructed through pixel-wise consistency between rendered and ground-truth views. The method further introduces a geometric warping metric to assess alignment quality and incorporates visibility-aware supervision to enhance reward accuracy. Evaluated on the RealEstate10K and WorldScore benchmarks, the proposed approach significantly improves alignment accuracy between generated videos and camera trajectories, demonstrating both effectiveness and computational efficiency.

Technology Category

Application Category

📝 Abstract

Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{https://a-bigbao.github.io/CamPilot/}{CamPilot Page}.

Problem

Research questions and friction points this paper is trying to address.

camera controllability

video diffusion model

reward feedback

3D geometry

video-camera alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

camera control

video diffusion model

reward feedback learning