🤖 AI Summary
Existing video diffusion models exhibit latent motion-aware capabilities, yet lack effective mechanisms to harness them. This work proposes Moaw, a framework that explicitly unlocks their motion understanding through supervised training—repurposing video diffusion models from generation to dense motion tracking. By constructing a high-quality motion-annotated dataset, the method extracts highly discriminative motion features and injects them into a structurally identical generative model. Notably, Moaw enables zero-shot cross-model motion transfer without requiring additional adapters or fine-tuning. This study establishes a novel paradigm bridging generative modeling and motion understanding, achieving high-quality, controllable video motion transfer in a plug-and-play manner.
📝 Abstract
Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.