🤖 AI Summary
This study addresses the energy-efficiency bottleneck in deploying small language models (SLMs) on edge devices. We systematically evaluate inference energy consumption and performance of Llama 3.2, Phi-3 Mini, TinyLlama, and Gemma 2 across Raspberry Pi 5, Jetson Nano, and Jetson Orin Nano platforms. Using standardized, cross-hardware (CPU/GPU) and cross-model energy-efficiency benchmarks, we quantitatively characterize— for the first time—the critical influence mechanisms of GPU acceleration, memory bandwidth, and model architecture on power draw. Results show that the Jetson Orin Nano achieves optimal energy efficiency when GPU acceleration is enabled; Llama 3.2 delivers the best trade-off between accuracy and power consumption; and TinyLlama is most suitable for ultra-low-power scenarios. We propose a hardware–model co-optimization pathway and establish a reproducible energy-efficiency evaluation framework with empirically grounded design guidelines for resource-constrained edge AI deployments.
📝 Abstract
Cloud-based large language models (LLMs) and their variants have significantly influenced real-world applications. Deploying smaller models (i.e., small language models (SLMs)) on edge devices offers additional advantages, such as reduced latency and independence from network connectivity. However, edge devices' limited computing resources and constrained energy budgets challenge efficient deployment. This study evaluates the power efficiency of five representative SLMs - Llama 3.2, Phi-3 Mini, TinyLlama, and Gemma 2 on Raspberry Pi 5, Jetson Nano, and Jetson Orin Nano (CPU and GPU configurations). Results show that Jetson Orin Nano with GPU acceleration achieves the highest energy-to-performance ratio, significantly outperforming CPU-based setups. Llama 3.2 provides the best balance of accuracy and power efficiency, while TinyLlama is well-suited for low-power environments at the cost of reduced accuracy. In contrast, Phi-3 Mini consumes the most energy despite its high accuracy. In addition, GPU acceleration, memory bandwidth, and model architecture are key in optimizing inference energy efficiency. Our empirical analysis offers practical insights for AI, smart systems, and mobile ad-hoc platforms to leverage tradeoffs from accuracy, inference latency, and power efficiency in energy-constrained environments.