SCOOT: SLO-Oriented Performance Tuning for LLM Inference Engines

📅 2024-08-08
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Dynamically optimizing Service-Level Objectives (SLOs) for large language model (LLM) inference services remains challenging, adversely affecting user satisfaction and cloud providers’ competitiveness. Method: This paper proposes a joint tuning framework integrating single- and multi-objective Bayesian optimization, dynamic implicit constraint learning, and a parallel suggestion mechanism. The framework supports service-level personalized SLO optimization and innovatively combines random forest–based constraint modeling, search-space pruning, and adapter interfaces for mainstream inference engines (vLLM and TensorRT-LLM). Contribution/Results: Evaluated in Ant Group’s production environment, the framework achieves significantly higher SLO compliance rates than baseline methods, accelerates tuning efficiency by multiple-fold, and demonstrates strong generalizability across major LLM inference engines.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) are gaining increasing popularity across a wide range of web applications, it is of great importance to optimize service-level objectives (SLOs) for LLM inference services to enhance user satisfaction and improve the competitiveness of cloud vendors. In this paper, we observe that adjusting the parameters of LLM inference engines can improve service performance, and the optimal parameter configurations of different services are different. Therefore, we propose SCOOT, an automatic performance tuning system to optimize SLOs for each LLM inference service by tuning the parameters of the inference engine. SCOOT jointly exploits single-objective and multiple-objective Bayesian optimization (BO) techniques to handle various optimization objectives via exploration and exploitation. Moreover, SCOOT prunes the search space with known constraints and adopts a random forest to learn hidden constraints during the tuning process to mitigate invalid exploration. To improve the tuning efficiency, SCOOT utilizes the parallel suggestion to accelerate the tuning process. Extensive experiments demonstrate that SCOOT considerably outperforms existing tuning techniques in SLO optimization while greatly improving the tuning efficiency. Moreover, SCOOT is universally applicable to various LLM inference engines including vLLM and TensorRT-LLM. Currently, SCOOT has already been implemented in the production environment at Ant Group.
Problem

Research questions and friction points this paper is trying to address.

Optimizes SLOs for LLM inference services
Uses Bayesian optimization for parameter tuning
Improves tuning efficiency with parallel suggestions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates LLM inference engine tuning
Utilizes Bayesian optimization techniques
Improves tuning efficiency via parallel suggestions
🔎 Similar Papers
No similar papers found.