Adaptively Robust LLM Inference Optimization under Prediction Uncertainty

📅 2025-08-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high latency, resource waste, and poor energy efficiency in large language model (LLM) inference scheduling—stemming from unknown output lengths—this paper proposes an adaptive online scheduling method based on prediction intervals. The core contribution is algorithm (A_{ ext{min}}), which relies solely on lower-bound predictions of output length, avoiding the excessive conservatism of conventional upper-bound-driven strategies; we theoretically establish its logarithmic competitive ratio, significantly enhancing robustness. The method integrates lightweight machine learning–based prediction, dynamic length estimation, and runtime correction, initializing scheduling with the minimal predicted length and continuously refining it. Experiments demonstrate that our approach achieves total latency and energy efficiency close to hindsight-optimal across diverse workloads and prediction error regimes, substantially outperforming baseline schedulers—particularly maintaining stable, high performance even under degraded prediction accuracy.

Technology Category

Application Category

📝 Abstract
We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total latency. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. A key challenge in LLM inference scheduling is that while the prompt length is known upon arrival, the output length, which critically impacts memory usage and processing time, is unknown. To address this uncertainty, we propose algorithms that leverage machine learning to predict output lengths, assuming the prediction provides an interval classification (min-max range) for each request. We first design a conservative algorithm, $mathcal{A}_{max}$, which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose $mathcal{A}_{min}$, an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that $mathcal{A}_{min}$ achieves a log-scale competitive ratio. Through numerical simulations, we demonstrate that $mathcal{A}_{min}$ often performs nearly as well as the hindsight scheduler, highlighting both its efficiency and robustness in practical scenarios. Moreover, $mathcal{A}_{min}$ relies solely on the lower bound of the prediction interval--an advantageous design choice since upper bounds on output length are typically more challenging to predict accurately.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM inference scheduling to minimize latency
Addressing output length uncertainty in LLM request processing
Reducing energy consumption while handling prompt requests
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive algorithm dynamically refines output length estimates
Leverages machine learning for output length interval predictions
Uses lower bound prediction to enhance efficiency and robustness
🔎 Similar Papers
No similar papers found.
Z
Zixi Chen
Department of Mathematics, Peking University
Yinyu Ye
Yinyu Ye
Professor of Emeritus, Stanford University and Visiting Professor of SJTU, CUHKSZ and HKUST
Optimization - Operations Research - Mathematical Programming - Computational Science
Z
Zijie Zhou
Department of Industrial Engineering and Decision Analytics, HKUST