🤖 AI Summary
To address the high inference latency and excessive GPU memory consumption of DiT-based video diffusion models in long-video generation, this paper proposes a dual-parallel efficient inference framework that jointly exploits inter-frame and inter-layer distributed computation. Key innovations include block-wise iterative denoising, cross-GPU feature caching, and globally consistent noise initialization—collectively overcoming the limitations of conventional sequential denoising. The framework enables artifact-free, temporally coherent generation of arbitrarily long high-resolution videos. Evaluated on an 8×RTX 4090 system, it generates 1025-frame videos with a 6.54× reduction in inference latency and a 1.48× decrease in GPU memory usage, while preserving state-of-the-art visual quality and temporal consistency. To our knowledge, this is the first work achieving minute-scale, high-definition video generation via efficient DiT inference—establishing a scalable, low-overhead paradigm for practical long-video synthesis.
📝 Abstract
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$ imes$ lower latency and 1.48$ imes$ lower memory cost on 8$ imes$RTX 4090 GPUs.