Reducing Compute Waste in LLMs through Kernel-Level DVFS

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the significant energy inefficiency in large language model (LLM) training and inference, where substantial power is wasted without proportional performance gains. To improve energy efficiency while preserving model performance, the authors propose a fine-grained, kernel-level dynamic voltage and frequency scaling (DVFS) approach. Unlike prior methods that apply DVFS at the layer or iteration level, this technique tailors voltage and frequency settings dynamically to individual computational kernels, leveraging GPU performance modeling and parallelism analysis to identify optimal configurations. Evaluated on GPT-3 training tasks, the method achieves up to 14.6% energy savings with only a 0.6% performance overhead. Furthermore, it demonstrates strong generalization across diverse data and tensor parallelism setups, markedly enhancing overall energy efficiency.

Technology Category

Application Category

📝 Abstract

The rapid growth of AI has fueled the expansion of accelerator- or GPU-based data centers. However, the rising operational energy consumption has emerged as a critical bottleneck and a major sustainability concern. Dynamic Voltage and Frequency Scaling (DVFS) is a well-known technique used to reduce energy consumption, and thus improve energy-efficiency, since it requires little effort and works with existing hardware. Reducing the energy consumption of training and inference of Large Language Models (LLMs) through DVFS or power capping is feasible: related work has shown energy savings can be significant, but at the cost of significant slowdowns. In this work, we focus on reducing waste in LLM operations: i.e., reducing energy consumption without losing performance. We propose a fine-grained, kernel-level, DVFS approach that explores new frequency configurations, and prove these save more energy than previous, pass- or iteration-level solutions. For example, for a GPT-3 training run, a pass-level approach could reduce energy consumption by 2% (without losing performance), while our kernel-level approach saves as much as 14.6% (with a 0.6% slowdown). We further investigate the effect of data and tensor parallelism, and show our discovered clock frequencies translate well for both. We conclude that kernel-level DVFS is a suitable technique to reduce waste in LLM operations, providing significant energy savings with negligible slow-down.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Energy Efficiency

Compute Waste

Dynamic Voltage and Frequency Scaling

LLM Operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

kernel-level DVFS

energy efficiency

large language models