Energy consumption of code small language models serving with runtime engines and execution providers

📅 2024-12-19

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study systematically investigates the impact of deep learning runtime configurations—specifically engine–execution provider pairings (e.g., PyTorch vs. ONNX Runtime, CUDA vs. CPU)—on energy efficiency, latency, and resource utilization during inference for small language models (SLMs) specialized for code. Leveraging a standardized benchmarking framework, we evaluate 12 open-source code SLMs using RAPL-based power measurement, system-level performance counters, and a unified inference API. To our knowledge, this is the first empirical, cross-engine, cross-provider energy-efficiency comparison for code SLMs. Results show that PyTorch with CUDA delivers the best overall trade-off: it reduces energy consumption by 37.99%–89.16% versus all other configurations while achieving lower latency and higher GPU utilization. For CPU-only deployment, ONNX Runtime with CPU execution improves energy efficiency by 8.98%–72.04% over comparable CPU-based alternatives. The study provides reproducible, evidence-based guidelines for energy-aware SLM deployment in production environments.

Technology Category

Application Category

📝 Abstract

Background. The rapid growth of Language Models (LMs), particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing LMs inference for energy efficiency is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Aim. Our goal is to analyze the impact of deep learning runtime engines and execution providers on energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code SLMs. Method. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Results. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Conclusions. Serving configuration choice significantly impacts energy efficiency. While further research is needed, we recommend the above configurations best suited to software engineers' requirements for enhancing serving efficiency in energy and performance.

Problem

Research questions and friction points this paper is trying to address.

Optimize resource use in code SLMs via serving configurations

Compare energy and time efficiency across runtime engines

Recommend best configurations for software engineers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Use SLMs with runtime engines for efficiency

Optimize CUDA and TORCH for energy savings

Compare ONNX and CPU for resource utilization

🔎 Similar Papers

Towards Pareto Optimal Throughput in Small Language Model Serving

2024-04-04EuroMLSys@EuroSysCitations: 5

Together AI

$160,000 - $230,000 + equity + benefits

San Francisco, Singapore, Amsterdam / Remote

Authors to Follow