🤖 AI Summary
This work addresses the challenges of deploying diffusion models at scale, including GPU memory constraints, imbalanced compute and memory loads, and the overhead and scheduling fragility introduced by staged execution across heterogeneous GPUs. To overcome these limitations, the authors propose decoupling the encoder, DiT, and decoder into independent services and organizing them into an asynchronous pipeline-parallel architecture. They further design an elastic scheduling strategy that integrates performance prediction with runtime feedback to enable efficient coordination across heterogeneous hardware. Compared to monolithic deployment, the proposed system achieves 3.4–20.5× higher throughput and up to 18.5× lower end-to-end latency, substantially improving deployment flexibility and resource efficiency.
📝 Abstract
Diffusion-based generation is increasingly powering production content pipelines; however, deploying these models at scale remains a significant challenge. Model weights frequently exceed the memory capacity of commodity GPUs, while the encoder, diffusion transformer (DiT), and decoder stages exhibit highly imbalanced computational and memory footprints. A natural remedy is disaggregated serving-running stages as separate services on heterogeneous GPUs-yet this introduces new bottlenecks, including stage handoff overheads and fast-changing workloads that make cross-stage provisioning and scheduling brittle.
This paper presents DisagFusion, enabling asynchronous pipeline parallelism and elastic scheduling for disaggregated diffusion serving. First, DisagFusion introduces asynchronous pipeline parallelism that overlaps computation and stage-to-stage communication to reduce pipeline bubbles and mitigate network jitter. Second, DisagFusion employs a hybrid instance scheduling strategy that combines lightweight performance prediction with runtime feedback to continuously rebalance instance ratio across stages under workload shifts. We implement DisagFusion and evaluate it with modern diffusion models. Compared to a monolithic baseline, DisagFusion improves throughput by 3.4x-20.5x and reduces end-to-end latency by 18.5x, while enabling flexible, cost-efficient deployment across heterogeneous GPUs.