🤖 AI Summary
Energy efficiency remains a critical bottleneck in large language model (LLM) inference, particularly under dynamic voltage and frequency scaling (DVFS) policies that introduce complex trade-offs among accuracy, latency, and power consumption.
Method: We design an end-to-end benchmark framework deployed on real hardware, covering diverse tasks—including text generation, question answering, and summarization—and systematically evaluate performance across multiple LLMs (Falcon-7B to T5-3B), input sequence lengths, and DVFS configurations. Leveraging fine-grained power monitoring, statistical modeling, and multivariate regression, we conduct the first unified energy-efficiency–performance attribution analysis spanning models, tasks, and hardware settings.
Contribution/Results: Our analysis uncovers nonlinear, hardware-level trade-off patterns induced by DVFS and identifies key energy-efficiency–sensitive parameters. Experiments demonstrate that optimal DVFS configuration reduces energy consumption by 32% while preserving >99.2% original accuracy and ≥95% throughput—providing reproducible, quantitative guidance for green LLM deployment.
📝 Abstract
Large language models (LLMs) have shown significant improvements in many natural language processing (NLP) tasks, accelerating their rapid adoption across many industries. These models are resource-intensive, requiring extensive computational resources both during training and inference, leading to increased energy consumption and negative environmental impact. As their adoption accelerates, the sustainability of LLMs has become a critical issue, necessitating strategies to optimize their runtime efficiency without compromising performance. Hence, it is imperative to identify the parameters that significantly influence the performance and energy efficiency of LLMs. To that end, in this work, we investigate the effect of important parameters on the performance and energy efficiency of LLMs during inference and examine their trade-offs. First, we analyze how different types of models with varying numbers of parameters and architectures perform on tasks like text generation, question answering, and summarization by benchmarking LLMs such as Falcon-7B, Mistral-7B-v0.1, T5-3B, GPT-2, GPT-J-6B, and GPT-Neo-2.7B. Second, we study input and output sequence characteristics such as sequence length concerning energy consumption, performance, and throughput. Finally, we explore the impact of hardware-based power-saving techniques, i.e., Dynamic Voltage Frequency Scaling (DVFS), on the models' latency and energy efficiency. Our extensive benchmarking and statistical analysis reveal many interesting findings, uncovering how specific optimizations can reduce energy consumption while maintaining throughput and accuracy. This study provides actionable insights for researchers and practitioners to design energy-efficient LLM inference systems.