🤖 AI Summary
This work addresses imprecise motion modeling and unnatural object motion in camera-controllable video generation. We propose FloVD, a flow-guided video diffusion model. Methodologically, we pioneer the explicit integration of optical flow as a motion prior into the video diffusion framework: (1) a first stage generates structurally coherent optical flow fields; (2) a second stage synthesizes videos conditioned on these flows. Background flow implicitly encodes 3D viewpoint correlations, enabling arbitrary 6-DoF camera control without requiring ground-truth camera parameters or paired annotations. Experiments demonstrate that FloVD significantly outperforms state-of-the-art methods in camera trajectory tracking error and motion naturalness metrics. Crucially, it maintains object motion consistency and fine-grained camera controllability even under unsupervised settings—without per-scene supervision or explicit 3D priors.
📝 Abstract
This paper presents FloVD, a novel optical-flow-based video diffusion model for camera-controllable video generation. FloVD leverages optical flow maps to represent motions of the camera and moving objects. This approach offers two key benefits. Since optical flow can be directly estimated from videos, our approach allows for the use of arbitrary training videos without ground-truth camera parameters. Moreover, as background optical flow encodes 3D correlation across different viewpoints, our method enables detailed camera control by leveraging the background motion. To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.