Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Pretrained text-to-video (T2V) models often exhibit motion suppression when adapted to image-to-video (I2V) generation, as high-frequency details in the conditioning image interfere with dynamic modeling, yielding overly static outputs. To address this, we propose Adaptive Low-pass Guidance (ALG), a plug-and-play mechanism that applies frequency-domain adaptive low-pass filtering to the conditioning image during early diffusion sampling steps. ALG selectively attenuates noise and redundant high-frequency components while preserving semantically critical structures—enabling joint optimization of motion coherence and frame fidelity without architectural modification or model retraining. To our knowledge, this is the first work to mitigate I2V static overfitting from a frequency-domain perspective. On the VBench-I2V benchmark, ALG improves motion dynamics by 36% on average, with no statistically significant degradation in image fidelity or text alignment, and maintains stable overall video quality.

Technology Category

Application Category

📝 Abstract

Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering at the early stage of denoising. Extensive experiments demonstrate that ALG significantly improves the temporal dynamics of generated videos, while preserving image fidelity and text alignment. Especially, under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.

Problem

Research questions and friction points this paper is trying to address.

Improving motion dynamics in image-to-video generation models

Addressing static video output due to high-frequency image details

Enhancing temporal dynamics without compromising image quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive low-pass guidance for I2V models

Modulates frequency content in conditioning image

Improves motion dynamics without quality loss

🔎 Similar Papers

Generalizable Implicit Motion Modeling for Video Frame Interpolation

2024-07-11Neural Information Processing SystemsCitations: 0

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

2024-08-01arXiv.orgCitations: 4