🤖 AI Summary
Existing video motion editing methods struggle to simultaneously control camera motion and non-rigid object deformation with precision, while lacking full-scene spatiotemporal consistency and fine-grained controllability. To address this, we propose a 3D point trajectory-based video motion editing framework. For the first time, sparse 3D point trajectories serve as explicit control signals, integrated with depth and hierarchical information to mitigate occlusion and model complex motion dependencies. Our video generation model is trained in stages to jointly optimize camera pose and object deformation, establishing spatiotemporal correspondence between source and target motions on both synthetic and real-world data. The method enables camera-object co-editing, motion transfer, and non-rigid deformation editing. Extensive experiments demonstrate significant improvements in editing accuracy and visual coherence—especially in challenging, cluttered scenes—while preserving global spatiotemporal consistency and enabling precise, localized motion control.
📝 Abstract
Camera and object motions are central to a video's narrative. However, precisely editing these captured motions remains a significant challenge, especially under complex object movements. Current motion-controlled image-to-video (I2V) approaches often lack full-scene context for consistent video editing, while video-to-video (V2V) methods provide viewpoint changes or basic object translation, but offer limited control over fine-grained object motion. We present a track-conditioned V2V framework that enables joint editing of camera and object motion. We achieve this by conditioning a video generation model on a source video and paired 3D point tracks representing source and target motions. These 3D tracks establish sparse correspondences that transfer rich context from the source video to new motions while preserving spatiotemporal coherence. Crucially, compared to 2D tracks, 3D tracks provide explicit depth cues, allowing the model to resolve depth order and handle occlusions for precise motion editing. Trained in two stages on synthetic and real data, our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation, unlocking new creative potential in video editing.