🤖 AI Summary
This work addresses the challenge of jointly controlling camera trajectories and preserving object motion consistency in video generation. We propose an implicit pixel-wise motion flow modeling framework that unifies camera and object motions into a single joint motion flow representation. Our method leverages reference motion maps for guidance and incorporates semantic object priors as constraints to jointly optimize motion coherence and cross-scene generalizability. Built upon the Stable Diffusion architecture, the model integrates an image-to-video generation network with a semantic prior module and is trained end-to-end. Evaluated across diverse complex camera motions—including orbiting, pitching, and zooming—our approach surpasses state-of-the-art methods in motion fidelity, trajectory tracking accuracy, and object motion stability. The proposed paradigm establishes a scalable, controllable motion modeling framework for video generation.
📝 Abstract
Generating videos guided by camera trajectories poses significant challenges in achieving consistency and generalizability, particularly when both camera and object motions are present. Existing approaches often attempt to learn these motions separately, which may lead to confusion regarding the relative motion between the camera and the objects. To address this challenge, we propose a novel approach that integrates both camera and object motions by converting them into the motion of corresponding pixels. Utilizing a stable diffusion network, we effectively learn reference motion maps in relation to the specified camera trajectory. These maps, along with an extracted semantic object prior, are then fed into an image-to-video network to generate the desired video that can accurately follow the designated camera trajectory while maintaining consistent object motions. Extensive experiments verify that our model outperforms SOTA methods by a large margin.