Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing image-to-video generation models often suffer from insufficient motion dynamics due to the dominance of the reference frame. This work identifies, for the first time, that this issue stems from non-reference frames excessively attending to the keys and values of the reference frame within the self-attention mechanism. To address this, the authors propose DyMoS—a training-free, model-agnostic dynamic motion modulation method. DyMoS dynamically reweights cross-frame attention pathways during the initial denoising stage of diffusion models using a single scalar parameter, enabling continuous control over motion intensity without altering the input image or model weights. Experiments demonstrate that DyMoS significantly enhances temporal dynamics across multiple state-of-the-art image-to-video models while preserving high visual quality and fidelity to the reference image.

📝 Abstract

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify \emph{reference-frame dominance} as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS~(Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.

Problem

Research questions and friction points this paper is trying to address.

image-to-video

motion suppression

reference-frame dominance

video generation

temporal dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-frame dominance

image-to-video

attention rebalancing

motion enhancement

training-free method

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

2024-08-01arXiv.orgCitations: 4

Generalizable Implicit Motion Modeling for Video Frame Interpolation

2024-07-11Neural Information Processing SystemsCitations: 0

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

2024-10-08Citations: 0