🤖 AI Summary
This study systematically investigates the impact of deep learning runtime configurations—specifically engine–execution provider pairings (e.g., PyTorch vs. ONNX Runtime, CUDA vs. CPU)—on energy efficiency, latency, and resource utilization during inference for small language models (SLMs) specialized for code. Leveraging a standardized benchmarking framework, we evaluate 12 open-source code SLMs using RAPL-based power measurement, system-level performance counters, and a unified inference API. To our knowledge, this is the first empirical, cross-engine, cross-provider energy-efficiency comparison for code SLMs. Results show that PyTorch with CUDA delivers the best overall trade-off: it reduces energy consumption by 37.99%–89.16% versus all other configurations while achieving lower latency and higher GPU utilization. For CPU-only deployment, ONNX Runtime with CPU execution improves energy efficiency by 8.98%–72.04% over comparable CPU-based alternatives. The study provides reproducible, evidence-based guidelines for energy-aware SLM deployment in production environments.
📝 Abstract
Background. The rapid growth of Language Models (LMs), particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing LMs inference for energy efficiency is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Aim. Our goal is to analyze the impact of deep learning runtime engines and execution providers on energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code SLMs. Method. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Results. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Conclusions. Serving configuration choice significantly impacts energy efficiency. While further research is needed, we recommend the above configurations best suited to software engineers' requirements for enhancing serving efficiency in energy and performance.