Geo-Align: Video Generation Alignment via Metric Geometry Reward

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing video re-rendering methods rely on synthetic data for supervised fine-tuning, which often fails to accurately adhere to physical scale and camera trajectories in real-world scenes, limiting their generalization. This work introduces reinforcement learning into camera-controlled video re-rendering for the first time, proposing an unpaired training framework built upon a pre-trained video generation model that jointly optimizes using real videos and synthetic camera trajectories. A geometry-aware reward mechanism is designed to explicitly enforce 3D scale consistency and physically plausible camera motion. Experiments demonstrate that the proposed approach significantly outperforms existing supervised learning baselines in both camera control accuracy and visual fidelity, without requiring synchronized multi-view real video data.

📝 Abstract

Camera-controlled video generation has achieved remarkable progress in recent years. However, existing video-to-video re-rendering methods primarily rely on Supervised Fine-Tuning using synthetic datasets. At present, there is an extreme scarcity of synchronized, multi-view real-world video data. Consequently, the prevailing paradigm often exhibits limited generalization when processing out-of-distribution real-world videos, with models struggling to accurately adhere to physical scales and camera trajectories. To bridge this gap, we propose Geo-Align, the first Reinforcement Learning framework specifically designed for camera-controlled video re-rendering. Built upon a pretrained model, we optimize the model through a scale-aware perceptual reward mechanism. Specifically, we introduce a metric 3D estimator to extract precise camera trajectories from generated videos, explicitly penalizing deviations in rotation and translation. Furthermore, we meticulously designed a data pipeline strategy based on real-world conditioning videos and target camera trajectories derived from synthetic data, eliminating the reliance on paired data. Extensive experiments demonstrate that Geo-Align consistently outperforms existing supervised learning baselines in both precise camera controllability and visual fidelity, indicating the effectiveness of our method.

Problem

Research questions and friction points this paper is trying to address.

camera-controlled video generation

video-to-video re-rendering

real-world video generalization

physical scale adherence

camera trajectory accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Camera-Controlled Video Generation

Metric Geometry Reward