BitStopper: An Efficient Transformer Attention Accelerator via Stage-fusion and Early Termination

📅 2025-12-06

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the high computational and memory overhead of Transformer self-attention—stemming from its quadratic complexity—this work proposes an algorithm–architecture co-optimization method that eliminates the need for a dedicated sparsity predictor. The approach introduces: (1) bit-level enabled stage fusion and early termination, removing the auxiliary prediction phase required by conventional dynamic sparse attention; (2) lightweight adaptive token selection coupled with bit-level asynchronous processing, enabling fine-grained hardware support for dynamic sparsity; and (3) synergistic integration of bit-serial computation, dynamic token termination, fine-grained memory scheduling, and hardware-level sparse speculation. Evaluated at equivalent accuracy, the design achieves 2.03× and 1.89× speedup over Sanger and SOFA accelerators, respectively, while improving energy efficiency by 2.4× and 2.1×.

Technology Category

Application Category

📝 Abstract

Attention-based large language models (LLMs) have transformed modern AI applications, but the quadratic cost of self-attention imposes significant compute and memory overhead. Dynamic sparsity (DS) attention mitigates this, yet its hardware efficiency is limited by the added prediction stage and the heavy memory traffic it entails. To address these limitations, this paper proposes BitStopper, a fine-grained algorithm-architecture co-design that operates without a sparsity predictor. First, a bit-serial enable stage fusion (BESF) mechanism is proposed to reuse and minimize the memory access by progressively terminating trivial tokens and merging the prediction stage into the execution stage. Second, a lightweight and adaptive token selection (LATS) strategy is developed to work in concert with the bit-level sparsity speculation. Third, a bit-level asynchronous processing (BAP) strategy is employed to improve compute utilization during the on-demand bit-grained memory fetching. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Extensive evaluations demonstrate that, compared to state-of-the-art (SOTA) Transformer accelerators, BitStopper achieves 2.03x and 1.89x speedups over Sanger and SOFA, respectively, while delivering 2.4x and 2.1x improvements in energy efficiency.

Problem

Research questions and friction points this paper is trying to address.

Reduces compute and memory overhead of self-attention in LLMs

Eliminates the need for a sparsity predictor in dynamic attention

Improves hardware efficiency via algorithm-architecture co-design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bit-serial enable stage fusion merges prediction and execution stages

Lightweight adaptive token selection works with bit-level sparsity speculation

Bit-level asynchronous processing improves compute utilization during memory fetching

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration