Taming the Titans: A Survey of Efficient LLM Inference Serving

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address high latency, low throughput, and resource inefficiency in large language model (LLM) generative inference services—stemming from substantial memory overhead and intensive computation—this paper proposes the first multi-granularity unified taxonomy covering instance-level, cluster-level, and scenario-level optimizations, systematically surveying over 100 efficient inference techniques. Methodologically, it integrates decoupled execution paradigms, decoding length prediction, GPU Multi-Instance GPU (MIG) load balancing, KV cache optimization, storage offloading, distributed deployment, and cloud-native orchestration, rigorously characterizing applicability boundaries and performance trade-offs for each technique. The contributions include filling a critical gap in systematic surveys of LLM inference systems, uncovering design principles that jointly optimize scalability, elasticity, and energy efficiency, and providing both theoretical foundations and practical guidelines for building industrial-grade inference systems with low latency, high throughput, and maximal resource utilization.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory overhead from LLM parameters

Optimizing computational demands of attention mechanism

Improving latency and throughput in LLM inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimizes instance-level model placement and scheduling

Enhances GPU cluster deployment for load balancing

Explores emerging scenarios with task-specific modules

🔎 Similar Papers

No similar papers found.