🤖 AI Summary
To address high latency, low throughput, and resource inefficiency in large language model (LLM) generative inference services—stemming from substantial memory overhead and intensive computation—this paper proposes the first multi-granularity unified taxonomy covering instance-level, cluster-level, and scenario-level optimizations, systematically surveying over 100 efficient inference techniques. Methodologically, it integrates decoupled execution paradigms, decoding length prediction, GPU Multi-Instance GPU (MIG) load balancing, KV cache optimization, storage offloading, distributed deployment, and cloud-native orchestration, rigorously characterizing applicability boundaries and performance trade-offs for each technique. The contributions include filling a critical gap in systematic surveys of LLM inference systems, uncovering design principles that jointly optimize scalability, elasticity, and energy efficiency, and providing both theoretical foundations and practical guidelines for building industrial-grade inference systems with low latency, high throughput, and maximal resource utilization.
📝 Abstract
Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.