FADE: Frequency-Aware Diffusion Model Factorization for Video Editing

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing video diffusion models achieve high generation quality but suffer from substantial computational overhead and limited support for efficient temporal editing (e.g., motion retargeting); conversely, image-editing methods lack explicit modeling of video dynamics. To address this, we propose a training-free, frequency-aware diffusion model decomposition framework. First, we characterize the temporal prior distribution of pre-trained video diffusion models in the frequency domain. Then, we introduce an attention-spectrum-guided component specialization mechanism and develop frequency-aware factorization coupled with spectrum-guided sampling modulation. Our method preserves spatiotemporal coherence while significantly enhancing both flexibility and fidelity in text-driven video editing. It enables high-quality, challenging tasks—including motion retargeting—on real-world videos, outperforming state-of-the-art image-based transfer and fine-tuning approaches in both qualitative and quantitative evaluations.

Technology Category

Application Category

📝 Abstract

Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling video dynamics, particularly for challenging temporal edits like motion adjustments. While current video diffusion models produce high-quality results, adapting them for efficient editing remains difficult due to the heavy computational demands that prevent the direct application of previous image editing techniques. To overcome these limitations, we introduce FADE, a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization. Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component's specialized role. Furthermore, we devise spectrum-guided modulation to refine the sampling trajectory with frequency domain cues, preventing information leakage and supporting efficient, versatile edits while preserving the basic spatial and temporal structure. Extensive experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results both qualitatively and quantitatively. Code is available at https://github.com/EternalEvan/FADE .

Problem

Research questions and friction points this paper is trying to address.

Handles video dynamics in diffusion-based editing

Reduces computational demands for efficient video edits

Preserves temporal coherence and spatial structure

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-aware factorization for video editing

Spectrum-guided modulation in frequency domain

Training-free leveraging pre-trained video models

🔎 Similar Papers

Edit3K: Universal Representation Learning for Video Editing Components