FADE: Frequency-Aware Diffusion Model Factorization for Video Editing

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion models achieve high generation quality but suffer from substantial computational overhead and limited support for efficient temporal editing (e.g., motion retargeting); conversely, image-editing methods lack explicit modeling of video dynamics. To address this, we propose a training-free, frequency-aware diffusion model decomposition framework. First, we characterize the temporal prior distribution of pre-trained video diffusion models in the frequency domain. Then, we introduce an attention-spectrum-guided component specialization mechanism and develop frequency-aware factorization coupled with spectrum-guided sampling modulation. Our method preserves spatiotemporal coherence while significantly enhancing both flexibility and fidelity in text-driven video editing. It enables high-quality, challenging tasks—including motion retargeting—on real-world videos, outperforming state-of-the-art image-based transfer and fine-tuning approaches in both qualitative and quantitative evaluations.

Technology Category

Application Category

📝 Abstract
Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling video dynamics, particularly for challenging temporal edits like motion adjustments. While current video diffusion models produce high-quality results, adapting them for efficient editing remains difficult due to the heavy computational demands that prevent the direct application of previous image editing techniques. To overcome these limitations, we introduce FADE, a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization. Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component's specialized role. Furthermore, we devise spectrum-guided modulation to refine the sampling trajectory with frequency domain cues, preventing information leakage and supporting efficient, versatile edits while preserving the basic spatial and temporal structure. Extensive experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results both qualitatively and quantitatively. Code is available at https://github.com/EternalEvan/FADE .
Problem

Research questions and friction points this paper is trying to address.

Handles video dynamics in diffusion-based editing
Reduces computational demands for efficient video edits
Preserves temporal coherence and spatial structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency-aware factorization for video editing
Spectrum-guided modulation in frequency domain
Training-free leveraging pre-trained video models
🔎 Similar Papers
No similar papers found.
Yixuan Zhu
Yixuan Zhu
Tsinghua University
Computer Vision
Haolin Wang
Haolin Wang
Ph.D. Student. Georgia Institute of Technology
infrastructure monitoringasset managementAIMLcomputer vision
S
Shilin Ma
Tsinghua Shenzhen International Graduate School, Tsinghua University
Wenliang Zhao
Wenliang Zhao
Tsinghua University
Computer VisionGenerative Models
Y
Yansong Tang
Tsinghua Shenzhen International Graduate School, Tsinghua University
L
Lei Chen
Department of Automation, Tsinghua University
J
Jie Zhou
Department of Automation, Tsinghua University