Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

To address severe frame flickering, spatiotemporal distortion, and high computational cost in long-video generation, this paper proposes a fine-tuning-free global-local collaborative diffusion framework. Methodologically, we design a frequency-aware noise reinitialization strategy—integrating local shuffling with frequency-domain fusion—and introduce a motion-consistency refinement module that jointly optimizes pixel-level and frequency-domain gradients to unify spatiotemporal denoising trajectories. Our core innovation lies in the first deep integration of frequency-domain modeling into both noise reinitialization and motion optimization, enabling synergistic enhancement of content consistency and inter-frame coherence. Experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on both visual fidelity and temporal consistency metrics for videos extended by 3× and 6× their original length.

Technology Category

Application Category

📝 Abstract

Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths ( extit{e.g.}, 3 imes and 6 imes longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.

Problem

Research questions and friction points this paper is trying to address.

Video Generation

Visual Coherence

Resource Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

GLC-Diffusion

VMCR module

noise reset technique

🔎 Similar Papers

Decoupled Video Generation with Chain of Training-free Diffusion Model Experts