TridentServe: A Stage-level Serving System for Diffusion Pipelines

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address inefficiencies in static deployment of diffusion models—stemming from coarse-grained resource allocation and the neglect of stage- and request-level heterogeneity—this paper proposes the first dynamic, stage-aware serving paradigm. Our approach jointly optimizes stage-level resource scheduling and request dispatching to enable fine-grained, load-aware, on-demand resource allocation, overcoming the limitations of conventional uniform pipeline-wide configuration. Key innovations include: (i) dynamic placement planning, (ii) stage-level resource modeling, and (iii) a load-driven real-time scheduling mechanism. Extensive experiments demonstrate that, compared to state-of-the-art systems, our method significantly improves SLO compliance rates, reduces average latency by up to 2.5×, and lowers P95 latency by 3.6–4.1×.

Technology Category

Application Category

📝 Abstract

Diffusion pipelines, renowned for their powerful visual generation capabilities, have seen widespread adoption in generative vision tasks (e.g., text-to-image/video). These pipelines typically follow an encode--diffuse--decode three-stage architecture. Current serving systems deploy diffusion pipelines within a static, manual, and pipeline-level paradigm, allocating the same resources to every request and stage. However, through an in-depth analysis, we find that such a paradigm is inefficient due to the discrepancy in resource needs across the three stages of each request, as well as across different requests. Following the analysis, we propose the dynamic stage-level serving paradigm and develop TridentServe, a brand new diffusion serving system. TridentServe automatically, dynamically derives the placement plan (i.e., how each stage resides) for pipeline deployment and the dispatch plan (i.e., how the requests are routed) for request processing, co-optimizing the resource allocation for both model and requests. Extensive experiments show that TridentServe consistently improves SLO attainment and reduces average/P95 latencies by up to 2.5x and 3.6x/4.1x over existing works across a variety of workloads.

Problem

Research questions and friction points this paper is trying to address.

Optimizing resource allocation across diffusion pipeline stages

Automating placement and dispatch plans for serving systems

Reducing latency and improving SLO in diffusion pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic stage-level serving for diffusion pipelines

Automated placement and dispatch plan optimization

Co-optimized resource allocation for models and requests

🔎 Similar Papers

PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference