Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments

📅 2024-11-24
🏛️ Proceedings of the 2025 58th IEEE/ACM International Symposium on Microarchitecture
📈 Citations: 5
Influential: 1
📄 PDF
🤖 AI Summary
Existing LLM inference systems for multi-LoRA adapter coexistence suffer from three key limitations: (1) workload heterogeneity is not captured by the scheduler, (2) frequent adapter loading induces CPU–GPU memory bandwidth bottlenecks, and (3) head-of-line blocking degrades latency and throughput. This paper proposes the first zero-GPU-memory-overhead adapter GPU memory caching mechanism, integrating heat-aware caching policies, on-demand LoRA weight loading, and GPU memory reuse. We further design a multi-priority, non-preemptive, adapter-aware scheduler to eliminate starvation and head-of-line blocking. Experiments demonstrate that our system reduces P99 and P50 time-to-first-token latency by 80.7% and 48.1%, respectively, and improves throughput by 1.5× over state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract
The effectiveness of LLMs has triggered an exponential rise in their deployment, imposing substantial demands on inference clusters. Such clusters often handle numerous concurrent queries for different LLM downstream tasks. To handle multi-task settings with vast LLM parameter counts, Low-Rank Adaptation (LoRA) enables task-specific fine-tuning while sharing most of the base LLM model across tasks. Hence, it supports concurrent task serving with reduced memory requirements. However, existing designs face inefficiencies: they overlook workload heterogeneity, impose high CPU-GPU link bandwidth from frequent adapter loading, and suffer from head-of-line blocking in their schedulers. To address these challenges, we present Chameleon, a novel LLM serving system optimized for many-adapter environments. Chameleon introduces two new ideas: adapter caching and adapter-aware scheduling. First, Chameleon caches popular adapters in GPU memory, minimizing adapter loading times. For caching, it uses otherwise idle GPU memory, avoiding extra memory costs. Second, Chameleon uses a non-preemptive multi-queue scheduler to efficiently account for workload heterogeneity. In this way, Chameleon simultaneously prevents head of line blocking and starvation. Under high loads, Chameleon reduces the P99 and P50 TTFT latencies by 80.7% and 48.1%, respectively, over a state-of-the-art baseline, while improving the throughput by 1.5 ×.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM inference clusters for concurrent multi-task query serving
Addressing inefficiencies from workload heterogeneity and frequent adapter loading
Solving head-of-line blocking and starvation in LLM scheduling systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Caches popular adapters in GPU memory
Uses non-preemptive multi-queue scheduling
Minimizes adapter loading times efficiently
🔎 Similar Papers
No similar papers found.