🤖 AI Summary
In multi-tenant large language model (LLM) serving, frequent adapter loading incurs high GPU resource overhead and degrades throughput. To address this, we propose a workload-aware dynamic adapter caching optimization method. Our approach features two key innovations: (1) the first reproducible, high-fidelity digital twin model for LLM-adapter serving, enabling online performance modeling and quantification of loading overhead; and (2) an AI-driven analytical pipeline that jointly performs real-time load forecasting and global cache allocation decisions—optimizing cache placement both within a single node and across multiple service replicas. Experimental evaluation demonstrates that our digital twin achieves ≤5.5% SMAPE error in throughput prediction; the caching policy generation exhibits low latency and high accuracy; and the overall system achieves significant improvements in GPU utilization and end-to-end throughput.
📝 Abstract
Serving LLM adapters has gained significant attention as an effective approach to adapt general-purpose language models to diverse, task-specific use cases. However, serving a wide range of adapters introduces several and substantial overheads, leading to performance degradation and challenges in optimal placement. To address these challenges, we present an analytical, AI-driven pipeline that accurately determines the optimal allocation of adapters in single-node setups. This allocation maximizes performance, effectively using GPU resources, while preventing request starvation. Crucially, the proposed allocation is given based on current workload patterns. These insights in single-node setups can be leveraged in multi-replica deployments for overall placement, load balancing and server configuration, ultimately enhancing overall performance and improving resource efficiency. Our approach builds on an in-depth analysis of LLM adapter serving, accounting for overheads and performance variability, and includes the development of the first Digital Twin capable of replicating online LLM-adapter serving systems with matching key performance metrics. The experimental results demonstrate that the Digital Twin achieves a SMAPE difference of no more than 5.5% in throughput compared to real results, and the proposed pipeline accurately predicts the optimal placement with minimal latency.