🤖 AI Summary
This study addresses the lack of systematic evaluation of energy consumption characteristics in speculative decoding strategies for large language models. It presents the first comprehensive quantification of fine-grained energy usage across diverse speculative decoding methods, examining their performance under varying model scales and architectures, decoding strategies, and datasets. The work uncovers the synergistic effects of model design, algorithmic choices, and data properties on energy efficiency, identifying key factors that determine the energy efficacy of speculative decoding. These findings provide empirical grounding and actionable insights for optimizing large-model inference toward lower energy consumption.
📝 Abstract
Speculative decoding has emerged as an effective method to reduce latency and inference cost of LLM inferences. However, there has been inadequate attention towards the energy requirements of these models. To address this gap, this paper presents a comprehensive survey of energy requirements of speculative decoding strategies, with detailed analysis on how various factors -- model size and family, speculative decoding strategies, and dataset characteristics -- influence the energy optimizations.