FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In image-to-video (I2V) generation, explicit concatenation of conditional images often causes “conditional image leakage,” leading to motion stagnation, color distortion, and degraded out-of-distribution generalization. To address this, we propose Fourier-guided Latent Displacement (FLD), the first approach to implicitly model conditional information from a distribution-matching perspective. Within a flow matching framework, FLD leverages high-frequency Fourier features to guide latent-space displacement operations—bypassing direct access to the original conditional image. The method introduces no additional parameters and achieves a state-of-the-art motion score of 53.01 on Vbench-I2V at 1.3B scale, outperforming several larger models. It significantly enhances video dynamism and cross-domain robustness, establishing a novel paradigm for implicit conditioning in I2V synthesis.

Technology Category

Application Category

📝 Abstract
In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Github page: https://pku-yuangroup.github.io/FlashI2V/
Problem

Research questions and friction points this paper is trying to address.

Prevents conditional image leakage in video generation
Addresses overfitting to in-domain data limitations
Improves generalization for out-of-domain scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fourier-guided latent shifting prevents conditional image leakage
Latent shifting implicitly incorporates condition by subtracting image information
Fourier guidance accelerates convergence and adjusts video detail levels
Yunyang Ge
Yunyang Ge
北京大学
Xinhua Cheng
Xinhua Cheng
Peking University
computer vision
C
Chengshu Zhao
Peking University, Shenzhen Graduate School
X
Xianyi He
Peking University, Shenzhen Graduate School
S
Shenghai Yuan
Peking University, Shenzhen Graduate School
B
Bin Lin
Peking University, Shenzhen Graduate School
B
Bin Zhu
Peking University, Shenzhen Graduate School
Li Yuan
Li Yuan
Research Associate, University of Science & Technology of China (USTC)
Antibiotic resistanceWastewater treatmentEnvironmental bioremediationAnaerobic digestionFate of organic pollutants