SLA-Awareness for AI-assisted coding

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

AI programming assistants deployed on shared clusters must simultaneously satisfy heterogeneous latency constraints (e.g., Time-to-First-Token, End-to-End) across diverse coding tasks (code completion, translation, summarization) while maximizing GPU resource utilization. This paper introduces CATO, the first end-to-end SLA-aware runtime scheduling framework for such workloads. CATO innovatively integrates latency-sensitive task classification, adaptive batching, priority-aware scheduling, and GPU memory reuse to enable fine-grained, per-task latency guarantees and dynamic resource sharing. Evaluated against state-of-the-art systems, CATO improves TTFT goodput for critical tasks by 10%, increases overall GPU utilization by 41.1%, reduces P95 E2E latency for code summarization by 18%, and lowers P95 TTFT for code generation by 14%. These results significantly alleviate the latency unpredictability bottleneck inherent in Model-as-a-Service deployments under concurrent inference workloads.

Technology Category

Application Category

📝 Abstract

The integration of AI-assisted coding tools within development environments drastically reduces development time, and allows developers to focus more on creative and critical aspects of software engineering through the use of Code Large Language Models (CodeLLMs). These coding assistants automate repetitive and time-consuming coding tasks such as code generation, code completion, code summarization, and code translation. Responsiveness is a crucial requirement of these coding assistants to maintain real-time interactivity, such that their use does not impede the developers' workflows. Different coding tasks have unique characteristics and latency requirements: Time-To-First-Token (TTFT) latency is essential for code completion tasks, while End-To-End (E2E) latency is crucial for code translation tasks. Managing these varying requirements simultaneously while optimizing resource usage poses significant challenges. Existing work adopts the Model-as-a-Service paradigm for serving individual CodeLLMs, but cannot effectively manage latency requirements of concurrent coding tasks and sequences of CodeLLM inference calls, due to a lack of end-to-end latency awareness. Another challenge is keeping resource utilization high, when the serving system is deployed on a shared cluster environment. To address these challenges, we propose Coding Assistant Task Orchestrator (CATO), a runtime system designed to serve a diverse assortment of coding tasks while meeting latency requirements and maximizing resource utilization. Our experiments demonstrate that when all types of coding tasks were served simultaneously, for TTFT-critical tasks, CATO improves overall Goodput rate and resource utilization by up to 10% and 41.1%, respectively. P95 E2E latency was also reduced by 18% for code summarization tasks, and P95 TTFT for code generation tasks were reduced by 14% compared against state-of-the-art systems.

Problem

Research questions and friction points this paper is trying to address.

Managing varying latency requirements for AI-assisted coding tasks

Optimizing resource utilization in shared cluster environments

Ensuring real-time interactivity without impeding developer workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses CodeLLMs for AI-assisted coding automation

Implements SLA-aware task orchestrator (CATO)

Optimizes latency and resource utilization simultaneously

🔎 Similar Papers

ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages