Attentions Under the Microscope: A Comparative Study of Resource Utilization for Variants of Self-Attention

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The high computational and energy costs of attention mechanisms in large language model training are increasingly becoming a bottleneck, yet existing efficient variants lack systematic, empirical evaluation of resource efficiency. Method: This paper presents the first end-to-end benchmarking study—grounded in energy efficiency—of eight self-attention variants (including Flash Attention, LSH Attention, and MLA) during GPT-2 training, measuring GPU memory footprint, training time, FLOPS, CPU utilization, and power consumption. Contribution/Results: Kernel-level optimizations (e.g., Flash Attention) yield substantial energy-efficiency gains; total energy consumption is jointly determined by training time and power draw, meaning reduced training duration alone does not guarantee lower energy use. Based on these findings, we propose “energy-aware attention design principles,” offering reproducible, empirically grounded guidance for attention mechanism selection in model architecture design.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) and visual language models (VLMs) grow in scale and application, attention mechanisms have become a central computational bottleneck due to their high memory and time complexity. While many efficient attention variants have been proposed, there remains a lack of rigorous evaluation on their actual energy usage and hardware resource demands during training. In this work, we benchmark eight attention mechanisms in training GPT-2 architecture, measuring key metrics including training time, GPU memory usage, FLOPS, CPU usage, and power consumption. Our results reveal that attention mechanisms with optimized kernel implementations, including Flash Attention, Locality-Sensitive Hashing (LSH) Attention, and Multi-Head Latent Attention (MLA), achieve the best energy efficiency. We further show that lower GPU power alone does not guarantee reduced energy use, as training time plays an equally important role. Our study highlights the importance of energy-aware benchmarking in attention design and provides a practical insight for selecting resource-efficient mechanisms. All our codes are available at GitHub.
Problem

Research questions and friction points this paper is trying to address.

Evaluating energy usage of attention mechanisms in LLMs/VLMs
Comparing resource demands of eight attention variants
Identifying energy-efficient attention designs for training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking eight attention mechanisms for efficiency
Optimized kernel implementations enhance energy efficiency
Energy-aware benchmarking guides attention mechanism selection
🔎 Similar Papers
No similar papers found.
Z
Zhengyu Tian
Department of Computer Science, Boston University Metropolitan College
A
Anantha Padmanaban Krishna Kumar
Department of Computer Science, Boston University Metropolitan College
H
Hemant Krishnakumar
Department of Computer Science, Boston University Metropolitan College
Reza Rawassizadeh
Reza Rawassizadeh
Associate Professor, Boston University
Digital HealthOn-device AIAI DemocratizationUbiquitous Computing