🤖 AI Summary
Pretrained text-to-video (T2V) models often exhibit motion suppression when adapted to image-to-video (I2V) generation, as high-frequency details in the conditioning image interfere with dynamic modeling, yielding overly static outputs. To address this, we propose Adaptive Low-pass Guidance (ALG), a plug-and-play mechanism that applies frequency-domain adaptive low-pass filtering to the conditioning image during early diffusion sampling steps. ALG selectively attenuates noise and redundant high-frequency components while preserving semantically critical structures—enabling joint optimization of motion coherence and frame fidelity without architectural modification or model retraining. To our knowledge, this is the first work to mitigate I2V static overfitting from a frequency-domain perspective. On the VBench-I2V benchmark, ALG improves motion dynamics by 36% on average, with no statistically significant degradation in image fidelity or text alignment, and maintains stable overall video quality.
📝 Abstract
Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering at the early stage of denoising. Extensive experiments demonstrate that ALG significantly improves the temporal dynamics of generated videos, while preserving image fidelity and text alignment. Especially, under VBench-I2V test suite, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.