Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing supervised fine-tuning (SFT) methods for camera-controlled video generation suffer from insufficient trajectory accuracy, while online reinforcement learning (RL) post-training remains unexplored in this domain. Method: We propose the first online RL post-training framework tailored for camera control. Its core innovations are: (1) a verifiable geometric reward function that leverages estimated 3D camera trajectories and segment-level relative pose alignment to provide dense, interpretable frame-wise feedback; and (2) a high-quality camera-control video dataset spanning broad motion ranges and complex dynamic scenes. Contribution/Results: Without modifying the model architecture, our method applies RL to optimize pretrained video diffusion models. It significantly outperforms SFT baselines in camera trajectory accuracy, geometric consistency, and visual quality—demonstrating the effectiveness and generalizability of geometry-guided RL for controllable video generation.

Technology Category

Application Category

📝 Abstract

Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.

Problem

Research questions and friction points this paper is trying to address.

Optimizes video generator for precise camera control using online RL

Designs verifiable geometry reward to provide dense segment-level feedback

Improves camera-control accuracy, geometric consistency, and visual quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online RL post-training optimizes pretrained video generator

Verifiable geometry reward provides dense segment-level feedback

Dataset with diverse camera motions improves training effectiveness

🔎 Similar Papers

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

2024-07-17arXiv.orgCitations: 28

Training-free Camera Control for Video Generation

2024-06-14arXiv.orgCitations: 13

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

AI Research Scientist, Computer Vision - Facebook Video Intelligence