Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

📅 2023-12-23

🏛️ arXiv.org

📈 Citations: 87

✨ Influential: 3

career value

205K/year

🤖 AI Summary

Large language model (LLM) inference faces fundamental challenges including high computational cost, GPU memory bottlenecks, and the difficulty of simultaneously achieving low latency and high throughput. Method: From an MLSys perspective, this paper proposes a unified cross-stack optimization framework spanning algorithmic and systems layers. It is the first to jointly model algorithmic techniques—such as KV cache compression and speculative decoding—with systems mechanisms—including PagedAttention, vLLM scheduling, heterogeneous offloading, and quantization-pruning—across both prefill and decoding phases. Contribution/Results: We introduce the first taxonomy for LLM serving that bridges algorithmic and systems stacks, distill a reusable full-stack optimization roadmap, and empirically validate its effectiveness in significantly reducing end-to-end latency, increasing throughput, and alleviating GPU memory pressure. Our framework provides both theoretical foundations and practical design patterns for production-grade LLM inference systems.

📝 Abstract

In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.

Problem

Research questions and friction points this paper is trying to address.

Addressing computational intensity in generative LLM serving

Optimizing memory consumption for low-latency LLM deployment

Surveying algorithmic-system co-design for efficient LLM serving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithmic modifications for efficiency

System design optimizations

Low latency high throughput solutions

🔎 Similar Papers

No similar papers found.