🤖 AI Summary
Existing diffusion models face significant challenges in generating long videos, including prohibitive computational costs, temporal discontinuities, motion incoherence, and progressive quality degradation. To address these issues, we propose a training-free “brick-to-wall” progressive latent-space denoising paradigm: long-video latent representations are partitioned into overlapping blocks; denoising is performed collaboratively in the latent space via sliding-window chunking, cross-block latent state alignment, and adaptive step scheduling. This approach overcomes the temporal modeling limitations of pretrained short-video diffusion models, substantially improving inter-frame consistency and motion naturalness. Experiments demonstrate that our method outperforms state-of-the-art baselines across multiple quantitative metrics—enabling minute-long, high-fidelity, and highly fluent video synthesis. The framework offers an efficient, scalable, and plug-and-play solution for long-video generation without requiring model retraining or architectural modification.
📝 Abstract
Recent advances in diffusion models have greatly improved text-driven video generation. However, training models for long video generation demands significant computational power and extensive data, leading most video diffusion models to be limited to a small number of frames. Existing training-free methods that attempt to generate long videos using pre-trained short video diffusion models often struggle with issues such as insufficient motion dynamics and degraded video fidelity. In this paper, we present Brick-Diffusion, a novel, training-free approach capable of generating long videos of arbitrary length. Our method introduces a brick-to-wall denoising strategy, where the latent is denoised in segments, with a stride applied in subsequent iterations. This process mimics the construction of a staggered brick wall, where each brick represents a denoised segment, enabling communication between frames and improving overall video quality. Through quantitative and qualitative evaluations, we demonstrate that Brick-Diffusion outperforms existing baseline methods in generating high-fidelity videos.