AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In enhanced large language model (LLM) inference serving, first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, while static batching fails to adapt to dynamic workloads and heterogeneous hardware. To address these challenges, this paper proposes AugServe, an adaptive scheduling framework. Its core innovations are: (1) a two-stage dynamic scheduler that jointly considers request-level inference characteristics (e.g., input/output length, decoding steps) and real-time system states (e.g., GPU memory utilization, queue occupancy); and (2) elastic token-based batching, which dynamically adjusts batch size per iteration to maximize hardware utilization without violating latency SLOs. Extensive experiments show that AugServe achieves 4.7–33.1× and 3.3–13.2× higher effective throughput than vLLM and InferCept, respectively, and reduces time-to-first-token latency by up to 96.3% and 95.0%. It also significantly improves SLO compliance and end-to-end service quality under variable load conditions.

Technology Category

Application Category

📝 Abstract
As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality. This paper presents AugServe, an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented LLM inference services. The core idea of AugServe is a two-stage adaptive request scheduling strategy. Specifically, AugServe combines the inference features of augmented LLM requests to optimize the order of scheduling decisions (stage I). These decisions are continuously refined with runtime information (stage II), adapting to both request characteristics and system capabilities. In addition, AugServe dynamically adjusts the token batching mechanism based on hardware status and real-time load, further enhancing throughput performance. Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.
Problem

Research questions and friction points this paper is trying to address.

Optimizes augmented LLM inference serving efficiency and SLOs
Addresses head-of-line blocking and static batch token limits
Enhances effective throughput and reduces queuing latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage adaptive request scheduling strategy
Dynamic token batching based on hardware status
Optimizes scheduling order using inference features
🔎 Similar Papers
No similar papers found.
Y
Ying Wang
Zhejiang University, China
Zhen Jin
Zhen Jin
Zhejiang University, China
J
Jiexiong Xu
Zhejiang University, China
W
Wenhai Lin
Alibaba Group, China
Y
Yiquan Chen
Alibaba Group, China
Wenzhi Chen
Wenzhi Chen
Chang Gung University
industrial designdesign educationlearningteaching