Adaptive Request Scheduling for CodeLLM Serving with SLA Guarantees

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

In self-hosted environments, static batching configurations for CodeLLM services struggle to handle request-rate fluctuations and heterogeneous workloads, leading to frequent SLA violations and unstable performance. To address this, we propose a dynamic request scheduling framework. Its core innovation is a lightweight SLA feasibility predictor that enables per-request latency estimation and real-time, adaptive batching decisions—eliminating the need for manual hyperparameter tuning. The framework jointly optimizes throughput, latency stability, and resource utilization under constrained hardware conditions. Experimental results demonstrate that, compared to the optimal static configuration, our approach achieves up to 26% higher effective throughput and reduces latency standard deviation by up to 45%. This significantly improves service robustness and SLA compliance, particularly under dynamic and resource-constrained deployment scenarios.

Technology Category

Application Category

📝 Abstract

Code Large Language Models (CodeLLMs) are increasingly integrated into modern software development workflows, yet efficiently serving them in resource-constrained, self-hosted environments remains a significant challenge. Existing LLM serving systems employs Continuous Batching for throughput improvement. However, they rely on static batch size configurations that cannot adapt to fluctuating request rates or heterogeneous workloads, leading to frequent SLA (Service Level Agreement) violations and unstable performance. In this study, We propose SABER, a dynamic batching strategy that predicts per-request SLA feasibility and adjusts decisions in real time. SABER improves goodput by up to 26% over the best static configurations and reduces latency variability by up to 45%, all without manual tuning or service restarts. Our results demonstrate that SLA-aware, adaptive scheduling is key to robust, high-performance CodeLLM serving.

Problem

Research questions and friction points this paper is trying to address.

Efficiently serving CodeLLMs in resource-constrained environments

Static batch sizes fail to adapt to dynamic workloads

SLA violations and unstable performance in existing systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic batching adapts to request fluctuations

Predicts SLA feasibility per request

Improves goodput and reduces latency variability

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing