🤖 AI Summary
To address load imbalance and low resource utilization in diffusion model inference on heterogeneous multi-GPU systems, this paper proposes STADI, a spatio-temporal adaptive diffusion inference framework. Methodologically, STADI introduces: (1) computation-aware dynamic denoising-step allocation, enabling fine-grained temporal parallelism via least-common-multiple quantization; (2) elastic image tiling and variable-size spatial parallelism to accommodate GPU computational heterogeneity; and (3) a hybrid spatio-temporal scheduler that jointly optimizes execution across both dimensions. Evaluated on diverse heterogeneous GPU clusters, STADI reduces end-to-end inference latency by up to 45% compared to baseline approaches—significantly outperforming existing tiling-based parallel methods. It effectively mitigates performance bottlenecks arising from hardware disparities and background system interference, thereby improving overall throughput and GPU utilization.
📝 Abstract
The escalating adoption of diffusion models for applications such as image generation demands efficient parallel inference techniques to manage their substantial computational cost. However, existing diffusion parallelism inference schemes often underutilize resources in heterogeneous multi-GPU environments, where varying hardware capabilities or background tasks cause workload imbalance. This paper introduces Spatio-Temporal Adaptive Diffusion Inference (STADI), a novel framework to accelerate diffusion model inference in such settings. At its core is a hybrid scheduler that orchestrates fine-grained parallelism across both temporal and spatial dimensions. Temporally, STADI introduces a novel computation-aware step allocator applied after warmup phases, using a least-common-multiple-minimizing quantization technique to reduce denoising steps on slower GPUs and execution synchronization. To further minimize GPU idle periods, STADI executes an elastic patch parallelism mechanism that allocates variably sized image patches to GPUs according to their computational capability, ensuring balanced workload distribution through a complementary spatial mechanism. Extensive experiments on both load-imbalanced and heterogeneous multi-GPU clusters validate STADI's efficacy, demonstrating improved load balancing and mitigation of performance bottlenecks. Compared to patch parallelism, a state-of-the-art diffusion inference framework, our method significantly reduces end-to-end inference latency by up to 45% and significantly improves resource utilization on heterogeneous GPUs.