DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In LLM inference services, severe GPU resource interference arises when prefill and decoding co-execute on shared GPUs, degrading tokens-between-time (TBT) latency and hindering simultaneous achievement of high throughput and low latency SLOs. This paper proposes an adaptive Streaming Multiprocessor (SM)-level GPU multiplexing framework. It introduces an attention-aware roofline model for precise per-iteration latency prediction, a fine-grained, demand-driven streaming SM space partitioning mechanism, and a CPU-GPU synchronization-free, interruption-transparent execution engine to dynamically decouple prefill and decoding within a single GPU. Evaluated against state-of-the-art systems under stringent latency SLOs, our approach achieves up to 1.3× higher throughput while maintaining stable generation latency, significantly reducing redundant KV cache transfers and scheduling overhead.

Technology Category

Application Category

📝 Abstract
Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades time-between-tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers. We present DuetServe, a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. Its key idea is to decouple prefill and decode execution only when needed through fine-grained, adaptive SM partitioning that provides phase isolation only when contention threatens latency service level objectives (SLOs). DuetServe integrates (1) an attention-aware roofline model to forecast iteration latency, (2) a partitioning optimizer that selects the optimal SM split to maximize throughput under TBT constraints, and (3) an interruption-free execution engine that eliminates CPU-GPU synchronization overhead. Evaluations show that DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.
Problem

Research questions and friction points this paper is trying to address.

Balancing prefill and decode phases in LLM serving
Eliminating GPU interference between inference phases
Optimizing throughput while maintaining latency SLOs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive GPU multiplexing for prefill and decode phases
SM-level spatial partitioning to isolate conflicting phases
Attention-aware roofline model for latency prediction optimization
🔎 Similar Papers
2024-06-03International Conference on Architectural Support for Programming Languages and Operating SystemsCitations: 4
L
Lei Gao
University of Southern California, Los Angeles, California, USA
C
Chaoyi Jiang
University of Southern California, Los Angeles, California, USA
H
Hossein Entezari Zarch
University of Southern California, Los Angeles, California, USA
Daniel Wong
Daniel Wong
Associate Professor, University of California, Riverside
Computer ArchitectureEnergy EfficiencyHigh Performance Computing
Murali Annavaram
Murali Annavaram
USC
Computer Systems