🤖 AI Summary
To address the data movement bottleneck caused by dynamic matrix multiplication in Transformer attention, this work proposes a digital-SRAM-based in-memory computing (IMC) macro. It enables weight residency by restructuring the QK weight matrix and decomposes the dynamic matrix multiplication into bit-serial shift-and-add operations—eliminating conventional physical multipliers. Circuit-level optimizations include zero-value skipping, data-driven wordline activation, read-write separated 6T SRAM cells, and bit-interleaved adders, collectively supporting high-precision computation. Fabricated in 65 nm CMOS, the design achieves 34.1 TOPS/W energy efficiency and 120.77 GOPS/mm² area efficiency—outperforming CPUs and GPUs by 25× and 13× in energy efficiency, respectively, and surpassing state-of-the-art IMC accelerators by over 7× in energy efficiency and 2× in area efficiency.
📝 Abstract
Compute-in-memory (CIM) techniques are widely employed in energy-efficient artificial intelligent (AI) processors. They alleviate power and latency bottlenecks caused by extensive data movements between compute and storage units. This work proposes a digital CIM macro to compute Transformer attention. To mitigate dynamic matrix multiplication that is unsuitable for the common weight-stationary CIM paradigm, we reformulate the attention score computation process based on a combined QK-weight matrix, so that inputs can be directly fed to CIM cells to obtain the score results. Moreover, the involved binomial matrix multiplication operation is decomposed into 4 groups of bit-serial shifting and additions, without costly physical multipliers in the CIM. We maximize the energy efficiency of the CIM circuit through zero-value bit-skipping, data-driven word line activation, read-write separate 6T cells and bit-alternating 14T/28T adders. The proposed CIM macro was implemented using a 65-nm process. It occupied only 0.35 mm2 area, and delivered a 42.27 GOPS peak performance with 1.24 mW power consumption at a 1.0 V power supply and a 100 MHz clock frequency, resulting in 34.1 TOPS/W energy efficiency and 120.77 GOPS/mm2 area efficiency. When compared to the CPU and GPU, our CIM macro is 25x and 13x more energy efficient on practical tasks, respectively. Compared with other Transformer-CIMs, our design exhibits at least 7x energy efficiency and at least 2x area efficiency improvements when scaled to the same technology node, showcasing its potential for edge-side intelligent applications.