LAPA: Log-Domain Prediction-Driven Dynamic Sparsity Accelerator for Transformer Model

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformers achieve state-of-the-art performance in NLP and CV, yet dynamic computational bottlenecks arise from variable input sequence lengths, challenging existing single-stage sparsity methods to simultaneously ensure energy efficiency and cross-stage adaptability. To address this, we propose a holistic algorithm–architecture co-design for cross-stage dynamic sparsity acceleration. Our method introduces a novel logarithmic-domain attention prediction mechanism to drastically reduce prediction overhead; further, it integrates multiplication-free leading-one detection (ALOC), mixed-precision multi-round shift-and-accumulate (MRSA), and data-feature-dependent filtering (DDF) to enable low-power approximate computation. Built upon this methodology, we implement a customized sparse accelerator that preserves model accuracy while delivering 3.52×, 3.24×, and 2.79× higher energy efficiency than Spatten, Sanger, and FACT—representing the current state of the art.

Technology Category

Application Category

📝 Abstract
Attention-based Transformers have revolutionized natural language processing (NLP) and shown strong performance in computer vision (CV) tasks. However, as the input sequence varies, the computational bottlenecks in Transformer models exhibit dynamic behavior across stages, which calls for a cross-stage sparse acceleration strategy. Unfortunately, most existing sparse Transformer approaches are single-stage based, and their sparsity prediction mechanisms lead to significant power overhead when applied across multiple stages. To this end, this paper proposes a log-domain attention prediction algorithm-architecture co-design, named LAPA. First, an asymmetric leading one computing (ALOC) scheme is designed to eliminate expensive multiplications. Next, a mixed-precision multi-round shifting accumulation (MRSA) mechanism is further proposed to mitigate the accumulation overhead. A data-feature dependent filter (DDF) strategy is designed to work in concert with the MRSA process. Finally, an elaborate accelerator is designed to translate the theoretical enhancement into practical hardware improvement. Experimental results show that LAPA achieves 3.52x, 3.24x and 2.79x higher energy efficiency than the state-of-the-art (SOTA) works Spatten, Sanger and FACT, respectively.
Problem

Research questions and friction points this paper is trying to address.

Addresses dynamic computational bottlenecks in Transformer models across stages
Reduces power overhead of sparsity prediction in multi-stage Transformers
Co-designs algorithm and architecture for efficient sparse attention acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Log-domain attention prediction algorithm-architecture co-design
Asymmetric leading one computing eliminates expensive multiplications
Mixed-precision multi-round shifting accumulation mitigates overhead
🔎 Similar Papers
No similar papers found.
Huizheng Wang
Huizheng Wang
Tsinghua University
Sparse AttentionLLM acceleratorAI InfraDistrbited ParallelismVLSI
H
Hongbin Wang
School of Integrated Circuits, Tsinghua University, Beijing, China
Shaojun Wei
Shaojun Wei
Professor, Tsinghua University
Y
Yang Hu
School of Integrated Circuits, Tsinghua University, Beijing, China
Shouyi Yin
Shouyi Yin
Tsinghua University