Minute-Long Videos with Dual Parallelisms

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

To address the high inference latency and excessive GPU memory consumption of DiT-based video diffusion models in long-video generation, this paper proposes a dual-parallel efficient inference framework that jointly exploits inter-frame and inter-layer distributed computation. Key innovations include block-wise iterative denoising, cross-GPU feature caching, and globally consistent noise initialization—collectively overcoming the limitations of conventional sequential denoising. The framework enables artifact-free, temporally coherent generation of arbitrarily long high-resolution videos. Evaluated on an 8×RTX 4090 system, it generates 1025-frame videos with a 6.54× reduction in inference latency and a 1.48× decrease in GPU memory usage, while preserving state-of-the-art visual quality and temporal consistency. To our knowledge, this is the first work achieving minute-scale, high-definition video generation via efficient DiT inference—establishing a scalable, low-overhead paradigm for practical long-video synthesis.

Technology Category

Application Category

📝 Abstract

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$ imes$ lower latency and 1.48$ imes$ lower memory cost on 8$ imes$RTX 4090 GPUs.

Problem

Research questions and friction points this paper is trying to address.

Reducing processing latency for long video generation

Minimizing memory costs in distributed video inference

Maintaining synchronization in parallelized diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

DualParal: parallelizes frames and layers across GPUs

Block-wise denoising with progressive noise levels

Feature cache and coordinated noise initialization

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding