🤖 AI Summary
Large language model (LLM) inference faces fundamental challenges including high computational cost, GPU memory bottlenecks, and the difficulty of simultaneously achieving low latency and high throughput.
Method: From an MLSys perspective, this paper proposes a unified cross-stack optimization framework spanning algorithmic and systems layers. It is the first to jointly model algorithmic techniques—such as KV cache compression and speculative decoding—with systems mechanisms—including PagedAttention, vLLM scheduling, heterogeneous offloading, and quantization-pruning—across both prefill and decoding phases.
Contribution/Results: We introduce the first taxonomy for LLM serving that bridges algorithmic and systems stacks, distill a reusable full-stack optimization roadmap, and empirically validate its effectiveness in significantly reducing end-to-end latency, increasing throughput, and alleviating GPU memory pressure. Our framework provides both theoretical foundations and practical design patterns for production-grade LLM inference systems.
📝 Abstract
In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.