🤖 AI Summary
This work proposes a training-free motion control method for video diffusion models that circumvents the need for additional training or substantial computational overhead typically required for temporal conditioning. By directly injecting low-frequency phase information from a reference video into the latent space of the diffusion noise, the approach enables precise control over both appearance and dynamics of the generated video without modifying the model architecture or retraining. This study introduces, for the first time, a frequency-domain phase-based mechanism for noise manipulation, significantly simplifying the control pipeline while enhancing flexibility. Experimental results demonstrate that the method achieves, and in several cases surpasses, the video generation quality and motion consistency of existing, more complex approaches across multiple tasks.
📝 Abstract
Latent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or visual inputs. However, existing conditioning methods often require additional training and computational overhead. Motivated by recent findings on the importance of frequency components in generative models, we propose a simple, training-free approach for motion-conditioned video generation by injecting low-frequency phase information from a reference video directly into the diffusion noise latents. Our method transfers motion cues without modifying the model architecture or inference pipeline. Using several applications, we demonstrate effective control over both appearance and dynamics in generated videos, while achieving competitive or superior results compared to more complex conditioning approaches.