🤖 AI Summary
To address severe frame flickering, spatiotemporal distortion, and high computational cost in long-video generation, this paper proposes a fine-tuning-free global-local collaborative diffusion framework. Methodologically, we design a frequency-aware noise reinitialization strategy—integrating local shuffling with frequency-domain fusion—and introduce a motion-consistency refinement module that jointly optimizes pixel-level and frequency-domain gradients to unify spatiotemporal denoising trajectories. Our core innovation lies in the first deep integration of frequency-domain modeling into both noise reinitialization and motion optimization, enabling synergistic enhancement of content consistency and inter-frame coherence. Experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on both visual fidelity and temporal consistency metrics for videos extended by 3× and 6× their original length.
📝 Abstract
Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths ( extit{e.g.}, 3 imes and 6 imes longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.