Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing supervised fine-tuning (SFT) methods for camera-controlled video generation suffer from insufficient trajectory accuracy, while online reinforcement learning (RL) post-training remains unexplored in this domain. Method: We propose the first online RL post-training framework tailored for camera control. Its core innovations are: (1) a verifiable geometric reward function that leverages estimated 3D camera trajectories and segment-level relative pose alignment to provide dense, interpretable frame-wise feedback; and (2) a high-quality camera-control video dataset spanning broad motion ranges and complex dynamic scenes. Contribution/Results: Without modifying the model architecture, our method applies RL to optimize pretrained video diffusion models. It significantly outperforms SFT baselines in camera trajectory accuracy, geometric consistency, and visual quality—demonstrating the effectiveness and generalizability of geometry-guided RL for controllable video generation.

Technology Category

Application Category

📝 Abstract
Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.
Problem

Research questions and friction points this paper is trying to address.

Optimizes video generator for precise camera control using online RL
Designs verifiable geometry reward to provide dense segment-level feedback
Improves camera-control accuracy, geometric consistency, and visual quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online RL post-training optimizes pretrained video generator
Verifiable geometry reward provides dense segment-level feedback
Dataset with diverse camera motions improves training effectiveness
🔎 Similar Papers
No similar papers found.