๐ค AI Summary
Diffusion models suffer from high inference latency due to sequential denoising, while multi-device parallelization faces substantial communication overhead and poor deployability on commercial hardware. This paper proposes ParaStep, a novel step-level parallelization framework that exploits cross-step latent state similarity to enable lightweight โreuseโpredictionโ-based parallel denoising steps. ParaStep eliminates conventional layer- or stage-level synchronization and instead employs a minimalist step-level communication protocol. Furthermore, it introduces a heterogeneous parallel scheduling framework tailored for SVD, CogVideoX-2b, and AudioLDM2-large. Evaluated on three representative cross-modal generative models, ParaStep achieves end-to-end speedups of 3.88ร, 2.43ร, and 6.56ร, respectively, with significantly reduced communication overhead and no degradation in generation fidelity. To the best of our knowledge, this is the first work to achieve efficient, fidelity-preserving, cross-device, step-level parallelism for both video and audio generation.
๐ Abstract
Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the inherently sequential nature of the denoising process. While existing parallelization strategies attempt to accelerate inference by distributing computation across multiple devices, they typically incur high communication overhead, hindering deployment on commercial hardware. To address this challenge, we propose extbf{ParaStep}, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps. Unlike prior approaches that rely on layer-wise or stage-wise communication, ParaStep employs lightweight, step-wise communication, substantially reducing overhead. ParaStep achieves end-to-end speedups of up to extbf{3.88}$ imes$ on SVD, extbf{2.43}$ imes$ on CogVideoX-2b, and extbf{6.56}$ imes$ on AudioLDM2-large, while maintaining generation quality. These results highlight ParaStep as a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.