🤖 AI Summary
This work addresses throughput optimization for small language model (SLM) inference under resource-constrained settings. Methodologically, it establishes, for the first time, that SLMs—owing to their substantial memory efficiency—can achieve near-theoretical-peak throughput on a single accelerator; further, it systematically demonstrates that model replication is the key strategy for improving hardware utilization and energy efficiency. Integrating system-level performance–energy co-benchmarking, memory bandwidth modeling, and throughput–latency trade-off analysis, the paper develops an interpretable optimization framework. Empirical evaluation shows that the proposed paradigm achieves Pareto-optimal throughput for SLM serving on a single GPU, improving hardware utilization by up to 3.2× and energy efficiency by up to 2.8×, thereby offering a novel pathway for efficient deployment of lightweight large language models.
📝 Abstract
Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.