🤖 AI Summary
LLM serving confronts a fundamental tension between limited resources and unpredictable traffic: static deployment yields low resource utilization, while instance-level dynamic scaling incurs high overhead and lacks fine-grained control. To address this, we propose CoCoServe—the first LLM serving system supporting module-level (e.g., decoding or projection layers) elastic scaling. Its core contributions are: (1) lightweight scaling via module-level replication and runtime migration; (2) an automated, performance–cost–aware scheduling mechanism; and (3) an end-to-end resource optimization framework. Experiments show that CoCoServe reduces latency by 14%–75%, improves throughput by 1.16×–4×, and cuts deployment cost by up to 46% compared to Hugging Face Transformers and vLLM. These gains significantly enhance serving efficiency, elasticity, and cost-effectiveness.
📝 Abstract
The rise of large language models (LLMs) has created new opportunities across various fields but has also introduced significant challenges in resource management. Current LLM serving systems face a fundamental tension: balancing serving demands with limited resources while adapting to unpredictable traffic patterns. Static deployments lead to suboptimal resource utilization and performance degradation under dynamic workloads. Furthermore, the high cost of adjusting instances hinders dynamic scaling, limiting the true potential of efficient LLM serving.
To address this, we propose CoCoServe, an elastic system that facilitates dynamic and fine-grained scaling. Its key innovation lies in the module-level operations for the replication and migration of LLM modules, such as decoder layers and projections. Through a comprehensive analysis of the trade-offs associated with these operations, we develop an auto-scaling mechanism that dynamically regulates module-level resource allocation and performance optimization, enabling a more cost-effective deployment of LLMs. Our evaluation demonstrates that the scaling operations employed by CoCoServe exhibit excellent scalability and can reduce costs by 46% while maintaining availability. Compared to state-of-the-art LLM serving systems (e.g., Hugging Face Transformers and vLLM), our approach reduces latency by 14%-75% and achieves 1.16x-4x throughput on average across different model sizes and workloads.