A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address the data movement bottleneck caused by dynamic matrix multiplication in Transformer attention, this work proposes a digital-SRAM-based in-memory computing (IMC) macro. It enables weight residency by restructuring the QK weight matrix and decomposes the dynamic matrix multiplication into bit-serial shift-and-add operations—eliminating conventional physical multipliers. Circuit-level optimizations include zero-value skipping, data-driven wordline activation, read-write separated 6T SRAM cells, and bit-interleaved adders, collectively supporting high-precision computation. Fabricated in 65 nm CMOS, the design achieves 34.1 TOPS/W energy efficiency and 120.77 GOPS/mm² area efficiency—outperforming CPUs and GPUs by 25× and 13× in energy efficiency, respectively, and surpassing state-of-the-art IMC accelerators by over 7× in energy efficiency and 2× in area efficiency.

Technology Category

Application Category

📝 Abstract

Compute-in-memory (CIM) techniques are widely employed in energy-efficient artificial intelligent (AI) processors. They alleviate power and latency bottlenecks caused by extensive data movements between compute and storage units. This work proposes a digital CIM macro to compute Transformer attention. To mitigate dynamic matrix multiplication that is unsuitable for the common weight-stationary CIM paradigm, we reformulate the attention score computation process based on a combined QK-weight matrix, so that inputs can be directly fed to CIM cells to obtain the score results. Moreover, the involved binomial matrix multiplication operation is decomposed into 4 groups of bit-serial shifting and additions, without costly physical multipliers in the CIM. We maximize the energy efficiency of the CIM circuit through zero-value bit-skipping, data-driven word line activation, read-write separate 6T cells and bit-alternating 14T/28T adders. The proposed CIM macro was implemented using a 65-nm process. It occupied only 0.35 mm2 area, and delivered a 42.27 GOPS peak performance with 1.24 mW power consumption at a 1.0 V power supply and a 100 MHz clock frequency, resulting in 34.1 TOPS/W energy efficiency and 120.77 GOPS/mm2 area efficiency. When compared to the CPU and GPU, our CIM macro is 25x and 13x more energy efficient on practical tasks, respectively. Compared with other Transformer-CIMs, our design exhibits at least 7x energy efficiency and at least 2x area efficiency improvements when scaled to the same technology node, showcasing its potential for edge-side intelligent applications.

Problem

Research questions and friction points this paper is trying to address.

Optimizes Transformer attention computation using digital SRAM-based compute-in-memory architecture

Reformulates dynamic matrix multiplication through combined QK-weight matrix decomposition

Enhances energy efficiency for edge AI applications via specialized circuit techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Digital CIM macro computes Transformer attention scores

Reformulates attention with combined QK-weight matrix

Uses bit-serial shifting and additions without multipliers

🔎 Similar Papers

No similar papers found.