Taming the Titans: A Survey of Efficient LLM Inference Serving

📅 2025-04-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high latency, low throughput, and resource inefficiency in large language model (LLM) generative inference services—stemming from substantial memory overhead and intensive computation—this paper proposes the first multi-granularity unified taxonomy covering instance-level, cluster-level, and scenario-level optimizations, systematically surveying over 100 efficient inference techniques. Methodologically, it integrates decoupled execution paradigms, decoding length prediction, GPU Multi-Instance GPU (MIG) load balancing, KV cache optimization, storage offloading, distributed deployment, and cloud-native orchestration, rigorously characterizing applicability boundaries and performance trade-offs for each technique. The contributions include filling a critical gap in systematic surveys of LLM inference systems, uncovering design principles that jointly optimize scalability, elasticity, and energy efficiency, and providing both theoretical foundations and practical guidelines for building industrial-grade inference systems with low latency, high throughput, and maximal resource utilization.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory overhead from LLM parameters
Optimizing computational demands of attention mechanism
Improving latency and throughput in LLM inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes instance-level model placement and scheduling
Enhances GPU cluster deployment for load balancing
Explores emerging scenarios with task-specific modules
🔎 Similar Papers
No similar papers found.