Unlock the Potential of Fine-grained LLM Serving via Dynamic Module Scaling

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

LLM serving confronts a fundamental tension between limited resources and unpredictable traffic: static deployment yields low resource utilization, while instance-level dynamic scaling incurs high overhead and lacks fine-grained control. To address this, we propose CoCoServe—the first LLM serving system supporting module-level (e.g., decoding or projection layers) elastic scaling. Its core contributions are: (1) lightweight scaling via module-level replication and runtime migration; (2) an automated, performance–cost–aware scheduling mechanism; and (3) an end-to-end resource optimization framework. Experiments show that CoCoServe reduces latency by 14%–75%, improves throughput by 1.16×–4×, and cuts deployment cost by up to 46% compared to Hugging Face Transformers and vLLM. These gains significantly enhance serving efficiency, elasticity, and cost-effectiveness.

Technology Category

Application Category

📝 Abstract

The rise of large language models (LLMs) has created new opportunities across various fields but has also introduced significant challenges in resource management. Current LLM serving systems face a fundamental tension: balancing serving demands with limited resources while adapting to unpredictable traffic patterns. Static deployments lead to suboptimal resource utilization and performance degradation under dynamic workloads. Furthermore, the high cost of adjusting instances hinders dynamic scaling, limiting the true potential of efficient LLM serving. To address this, we propose CoCoServe, an elastic system that facilitates dynamic and fine-grained scaling. Its key innovation lies in the module-level operations for the replication and migration of LLM modules, such as decoder layers and projections. Through a comprehensive analysis of the trade-offs associated with these operations, we develop an auto-scaling mechanism that dynamically regulates module-level resource allocation and performance optimization, enabling a more cost-effective deployment of LLMs. Our evaluation demonstrates that the scaling operations employed by CoCoServe exhibit excellent scalability and can reduce costs by 46% while maintaining availability. Compared to state-of-the-art LLM serving systems (e.g., Hugging Face Transformers and vLLM), our approach reduces latency by 14%-75% and achieves 1.16x-4x throughput on average across different model sizes and workloads.

Problem

Research questions and friction points this paper is trying to address.

Balancing LLM serving demands with limited resources

Overcoming static deployment inefficiencies in dynamic workloads

Reducing high costs of dynamic scaling in LLM serving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic module scaling for LLM serving

Module-level replication and migration operations

Auto-scaling mechanism for cost-effective deployment

🔎 Similar Papers

No similar papers found.