Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dynamic output length in LLM inference causes severe load imbalance during decoding, leading to SLO violations and out-of-memory (OOM) errors. To address this, we propose an adaptive rescheduling system based on output-length prediction. Our approach introduces the first lightweight, continuous predictor that leverages native LLM hidden states—without requiring auxiliary tokens or fine-tuning—to model remaining generation length with fine-grained accuracy and minimal overhead. Based on real-time predictions, the system dynamically reallocates prefill and decode resources to enable load-aware scheduling. Experiments demonstrate a 49.42% reduction in mean absolute error (MAE) for length prediction, a 93.28% decrease in predictor parameter count, a 74.77% reduction in P99 time-per-output-token (TPOT), and up to a 2.24× improvement in goodput—achieving substantial gains in throughput and latency while maintaining system stability.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) inference has emerged as a fundamental paradigm. In real-world scenarios, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose ARES, an adaptive decoding rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with : A dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 74.77% and achieving up to 2.24 times higher goodput.
Problem

Research questions and friction points this paper is trying to address.

Addressing workload imbalance in LLM decode phase
Improving static scheduling in disaggregated inference systems
Reducing SLO violations and OOM failures during generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight LLM-native prediction method for length forecasting
Dynamic workload balancing using current and predicted loads
Rescheduling solution reducing latency and increasing goodput
🔎 Similar Papers
No similar papers found.
Zhibin Wang
Zhibin Wang
Zhejiang University
new particle formationaerosolshygroscopicityblack carbon
Z
Zetao Hong
State Key Laboratory for Novel Software Technology, Nanjing University
X
Xue Li
Alibaba Group
Z
Zibo Wang
State Key Laboratory for Novel Software Technology, Nanjing University
S
Shipeng Li
State Key Laboratory for Novel Software Technology, Nanjing University
Qingkai Meng
Qingkai Meng
Assistant Professor, Nanjing University; Tsinghua University; UW-Madison;
Data Center NetworkML SystemTransport Protocol
Q
Qing Wang
State Key Laboratory for Novel Software Technology, Nanjing University
C
Chengying Huan
State Key Laboratory for Novel Software Technology, Nanjing University
Rong Gu
Rong Gu
Mälardalen University
Formal MethodsMachine LearningAutonomous Systems
Sheng Zhong
Sheng Zhong
Nanjing University
computer networkssecurity and privacytheory of computing
Chen Tian
Chen Tian
Prof. of Nanjing University
Data Center NetworkingNetwork Function VirtualisationContent Distribution