Efficient Multi-round LLM Inference over Disaggregated Serving

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of existing large language model (LLM) serving systems in handling interleaved prefill and decode workloads during multi-turn reasoning, which leads to suboptimal resource scheduling and degraded SLO compliance. To tackle this challenge, the authors propose AMPD, a novel framework that introduces an adaptive prefill scheduling mechanism and a joint resource allocation algorithm tailored for architectures that decouple prefill and decoding stages. AMPD dynamically senses workload patterns and co-optimizes the execution placement, parallelization strategy, and resource provisioning across both phases. Experimental results demonstrate that AMPD substantially improves SLO attainment compared to state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
With the rapid evolution of Large Language Models (LLMs), multi-round workflows, such as autonomous agents and iterative retrieval, have become increasingly prevalent. However, this raises hurdles for serving LLMs under prefill-decode (PD) disaggregation, a widely adopted paradigm that separates the compute-bound prefill phase and memory-bound decode phase onto individual resources. Specifically, existing systems overlook the interleaved prefill-decode workload pattern in multi-round inference, leading to sub-optimal handling of the incremental prefill workloads and model deployment for the two phases. In this work, we present AMPD, a brand new disaggregated serving framework for multi-round LLM inference. The core of AMPD is to coordinate the prefill workloads based on real-time workloads by adaptively determining where to carry out these workloads and how they are scheduled, in order to maximize service level objective (SLO) attainment. In addition, we tailor a planning algorithm for our scenario, facilitating the deduction of optimal resource allocation and parallel strategies for the two phases. Empirical results demonstrate that AMPD substantially improves SLO attainment compared to state-of-the-art baselines.
Problem

Research questions and friction points this paper is trying to address.

multi-round inference
LLM serving
prefill-decode disaggregation
workload interleaving
SLO attainment
Innovation

Methods, ideas, or system contributions that make the work stand out.

disaggregated serving
multi-round inference
prefill-decode scheduling
adaptive workload coordination
SLO optimization
🔎 Similar Papers
No similar papers found.