🤖 AI Summary
The high computational and energy costs of attention mechanisms in large language model training are increasingly becoming a bottleneck, yet existing efficient variants lack systematic, empirical evaluation of resource efficiency. Method: This paper presents the first end-to-end benchmarking study—grounded in energy efficiency—of eight self-attention variants (including Flash Attention, LSH Attention, and MLA) during GPT-2 training, measuring GPU memory footprint, training time, FLOPS, CPU utilization, and power consumption. Contribution/Results: Kernel-level optimizations (e.g., Flash Attention) yield substantial energy-efficiency gains; total energy consumption is jointly determined by training time and power draw, meaning reduced training duration alone does not guarantee lower energy use. Based on these findings, we propose “energy-aware attention design principles,” offering reproducible, empirically grounded guidance for attention mechanism selection in model architecture design.
📝 Abstract
As large language models (LLMs) and visual language models (VLMs) grow in scale and application, attention mechanisms have become a central computational bottleneck due to their high memory and time complexity. While many efficient attention variants have been proposed, there remains a lack of rigorous evaluation on their actual energy usage and hardware resource demands during training. In this work, we benchmark eight attention mechanisms in training GPT-2 architecture, measuring key metrics including training time, GPU memory usage, FLOPS, CPU usage, and power consumption. Our results reveal that attention mechanisms with optimized kernel implementations, including Flash Attention, Locality-Sensitive Hashing (LSH) Attention, and Multi-Head Latent Attention (MLA), achieve the best energy efficiency. We further show that lower GPU power alone does not guarantee reduced energy use, as training time plays an equally important role. Our study highlights the importance of energy-aware benchmarking in attention design and provides a practical insight for selecting resource-efficient mechanisms. All our codes are available at GitHub.