Towards Pareto Optimal Throughput in Small Language Model Serving

📅 2024-04-04

🏛️ EuroMLSys@EuroSys

📈 Citations: 5

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work addresses throughput optimization for small language model (SLM) inference under resource-constrained settings. Methodologically, it establishes, for the first time, that SLMs—owing to their substantial memory efficiency—can achieve near-theoretical-peak throughput on a single accelerator; further, it systematically demonstrates that model replication is the key strategy for improving hardware utilization and energy efficiency. Integrating system-level performance–energy co-benchmarking, memory bandwidth modeling, and throughput–latency trade-off analysis, the paper develops an interpretable optimization framework. Empirical evaluation shows that the proposed paradigm achieves Pareto-optimal throughput for SLM serving on a single GPU, improving hardware utilization by up to 3.2× and energy efficiency by up to 2.8×, thereby offering a novel pathway for efficient deployment of lightweight large language models.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.

Problem

Research questions and friction points this paper is trying to address.

Benchmark SLM inference performance and energy levels

Achieve Pareto-optimal throughput with SLMs

Improve resource utilization via model replication

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark SLM inference performance and energy

Achieve Pareto-optimal throughput on single accelerator

Improve resource utilization via model replication

🔎 Similar Papers

No similar papers found.