Towards Pareto Optimal Throughput in Small Language Model Serving

📅 2024-04-04
🏛️ EuroMLSys@EuroSys
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses throughput optimization for small language model (SLM) inference under resource-constrained settings. Methodologically, it establishes, for the first time, that SLMs—owing to their substantial memory efficiency—can achieve near-theoretical-peak throughput on a single accelerator; further, it systematically demonstrates that model replication is the key strategy for improving hardware utilization and energy efficiency. Integrating system-level performance–energy co-benchmarking, memory bandwidth modeling, and throughput–latency trade-off analysis, the paper develops an interpretable optimization framework. Empirical evaluation shows that the proposed paradigm achieves Pareto-optimal throughput for SLM serving on a single GPU, improving hardware utilization by up to 3.2× and energy efficiency by up to 2.8×, thereby offering a novel pathway for efficient deployment of lightweight large language models.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.
Problem

Research questions and friction points this paper is trying to address.

Benchmark SLM inference performance and energy levels
Achieve Pareto-optimal throughput with SLMs
Improve resource utilization via model replication
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark SLM inference performance and energy
Achieve Pareto-optimal throughput on single accelerator
Improve resource utilization via model replication
🔎 Similar Papers
No similar papers found.