A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the data movement bottleneck caused by dynamic matrix multiplication in Transformer attention, this work proposes a digital-SRAM-based in-memory computing (IMC) macro. It enables weight residency by restructuring the QK weight matrix and decomposes the dynamic matrix multiplication into bit-serial shift-and-add operations—eliminating conventional physical multipliers. Circuit-level optimizations include zero-value skipping, data-driven wordline activation, read-write separated 6T SRAM cells, and bit-interleaved adders, collectively supporting high-precision computation. Fabricated in 65 nm CMOS, the design achieves 34.1 TOPS/W energy efficiency and 120.77 GOPS/mm² area efficiency—outperforming CPUs and GPUs by 25× and 13× in energy efficiency, respectively, and surpassing state-of-the-art IMC accelerators by over 7× in energy efficiency and 2× in area efficiency.

Technology Category

Application Category

📝 Abstract
Compute-in-memory (CIM) techniques are widely employed in energy-efficient artificial intelligent (AI) processors. They alleviate power and latency bottlenecks caused by extensive data movements between compute and storage units. This work proposes a digital CIM macro to compute Transformer attention. To mitigate dynamic matrix multiplication that is unsuitable for the common weight-stationary CIM paradigm, we reformulate the attention score computation process based on a combined QK-weight matrix, so that inputs can be directly fed to CIM cells to obtain the score results. Moreover, the involved binomial matrix multiplication operation is decomposed into 4 groups of bit-serial shifting and additions, without costly physical multipliers in the CIM. We maximize the energy efficiency of the CIM circuit through zero-value bit-skipping, data-driven word line activation, read-write separate 6T cells and bit-alternating 14T/28T adders. The proposed CIM macro was implemented using a 65-nm process. It occupied only 0.35 mm2 area, and delivered a 42.27 GOPS peak performance with 1.24 mW power consumption at a 1.0 V power supply and a 100 MHz clock frequency, resulting in 34.1 TOPS/W energy efficiency and 120.77 GOPS/mm2 area efficiency. When compared to the CPU and GPU, our CIM macro is 25x and 13x more energy efficient on practical tasks, respectively. Compared with other Transformer-CIMs, our design exhibits at least 7x energy efficiency and at least 2x area efficiency improvements when scaled to the same technology node, showcasing its potential for edge-side intelligent applications.
Problem

Research questions and friction points this paper is trying to address.

Optimizes Transformer attention computation using digital SRAM-based compute-in-memory architecture
Reformulates dynamic matrix multiplication through combined QK-weight matrix decomposition
Enhances energy efficiency for edge AI applications via specialized circuit techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Digital CIM macro computes Transformer attention scores
Reformulates attention with combined QK-weight matrix
Uses bit-serial shifting and additions without multipliers
🔎 Similar Papers
No similar papers found.
J
Jianyi Yu
School of Microelectronics and Communication Engineering, Chongqing University, 400030 Chongqing, China
Y
Yuxuan Wang
School of Microelectronics and Communication Engineering, Chongqing University, 400030 Chongqing, China
X
Xiang Fu
School of Microelectronics and Communication Engineering, Chongqing University, 400030 Chongqing, China
F
Fei Qiao
Department of Electronic Engineering, Tsinghua University, 100084 Beijing, China
Y
Ying Wang
Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China
Rui Yuan
Rui Yuan
Unknown affiliation
Machine learningDeep learningReinforcement learningOptimization
Liyuan Liu
Liyuan Liu
Microsoft Research
C
Cong Shi
School of Microelectronics and Communication Engineering, Chongqing University, 400030 Chongqing, China