GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

πŸ“… 2026-04-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of co-serving heterogeneous diffusion modelsβ€”such as text-to-image (T2I) and text-to-video (T2V)β€”on shared GPU clusters, where significant differences in computational demands, parallelization characteristics, and latency requirements often lead to severe violations of service-level objectives (SLOs). Leveraging the step-level predictability and preemptibility inherent in diffusion inference, the authors propose a novel co-optimization framework that integrates intelligent video preemption, elastic sequence parallelism, dynamic batching, and an SLO-aware scheduler to enable efficient resource allocation. This approach establishes a new paradigm for serving diverse generative workloads, achieving up to a 44% improvement in SLO compliance over the strongest baseline across various configurations while substantially enhancing both resource utilization and service quality.
πŸ“ Abstract
Diffusion models have emerged as the prevailing approach for text-to-image (T2I) and text-to-video (T2V) generation, yet production platforms must increasingly serve both modalities on shared GPU clusters while meeting stringent latency SLOs. Co-serving such heterogeneous workloads is challenging: T2I and T2V requests exhibit vastly different compute demands, parallelism characteristics, and latency requirements, leading to significant SLO violations in existing serving systems. We present GENSERVE, a co-serving system that leverages the inherent predictability of the diffusion process to optimize serving efficiency. A central insight is that diffusion inference proceeds in discrete, predictable steps and is naturally preemptible at step boundaries, opening a new design space for heterogeneity-aware resource management. GENSERVE introduces step-level resource adaptation through three coordinated mechanisms: intelligent video preemption, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that jointly optimizes resource allocation across all concurrent requests. Experimental results show that GENSERVE improves the SLO attainment rate by up to 44% over the strongest baseline across diverse configurations.
Problem

Research questions and friction points this paper is trying to address.

diffusion models
heterogeneous workloads
co-serving
latency SLOs
GPU clusters
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion models
co-serving
step-level preemption
elastic sequence parallelism
SLO-aware scheduling
πŸ”Ž Similar Papers
No similar papers found.
F
Fanjiang Ye
Rice University
Z
Zhangke Li
Rice University
X
Xinrui Zhong
Rice University
E
Ethan Ma
Rice University
R
Russell Chen
Rice University
K
Kaijian Wang
Rice University
J
Jingwei Zuo
Rice University
D
Desen Sun
University of Waterloo
Y
Ye Cao
Independent Researcher
T
Triston Cao
NVIDIA
Myungjin Lee
Myungjin Lee
Cisco Systems
NetworkingSystems
Arvind Krishnamurthy
Arvind Krishnamurthy
Short-Dooley Professor, Univ. of Washington
Distributed SystemsNetworkingSystemsPerformance evaluation
Yuke Wang
Yuke Wang
Assistant Professor@Rice University
System for Machine LearningHigh-performance Computing