🤖 AI Summary
Existing image-to-video generation models often suffer from insufficient motion dynamics due to the dominance of the reference frame. This work identifies, for the first time, that this issue stems from non-reference frames excessively attending to the keys and values of the reference frame within the self-attention mechanism. To address this, the authors propose DyMoS—a training-free, model-agnostic dynamic motion modulation method. DyMoS dynamically reweights cross-frame attention pathways during the initial denoising stage of diffusion models using a single scalar parameter, enabling continuous control over motion intensity without altering the input image or model weights. Experiments demonstrate that DyMoS significantly enhances temporal dynamics across multiple state-of-the-art image-to-video models while preserving high visual quality and fidelity to the reference image.
📝 Abstract
Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify \emph{reference-frame dominance} as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS~(Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.