3D-TrIM: A Memory-Efficient Spatial Computing Architecture for Convolution Workloads

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address the high memory-access overhead and von Neumann bottleneck induced by convolutional computation in CNN accelerators, this paper proposes a three-dimensional extended triangular dataflow architecture. The architecture introduces a novel triangular-motion-based local reuse mechanism, synergistically integrating shadow registers and cross-slice shared shift-register buffers—while preserving the systolic array structure—to significantly enhance data reuse efficiency and hardware resource utilization. Implemented in 22 nm CMOS technology, the design instantiates a 576-PE systolic array achieving 4.47 TOPS/mm² area efficiency and 4.54 TOPS/W energy efficiency. On VGG-16 and AlexNet, memory accesses are reduced to just 29.7% of those required by TrIM, demonstrating substantial improvements in alleviating memory bandwidth pressure and reducing energy consumption.

Technology Category

Application Category

📝 Abstract

The Von Neumann bottleneck, which relates to the energy cost of moving data from memory to on-chip core and vice versa, is a serious challenge in state-of-the-art AI architectures, like Convolutional Neural Networks' (CNNs) accelerators. Systolic arrays exploit distributed processing elements that exchange data with each other, thus mitigating the memory cost. However, when involved in convolutions, data redundancy must be carefully managed to avoid significant memory access overhead. To overcome this problem, TrIM has been recently proposed. It features a systolic array based on an innovative dataflow, where input feature map (ifmap) activations are locally reused through a triangular movement. However, ifmaps still suffer from memory accesses overhead. This work proposes 3D-TrIM, an upgraded version of TrIM that addresses the memory access overhead through few extra shadow registers. In addition, due to a change in the architectural orientation, the local shift register buffers are now shared between different slices, thus improving area and energy efficiency. An architecture of 576 processing elements is implemented on commercial 22 nm technology and achieves an area efficiency of 4.47 TOPS/mm$^2$ and an energy efficiency of 4.54 TOPS/W. Finally, 3D-TrIM outperforms TrIM by up to $3.37 imes$ in terms of operations per memory access considering CNN topologies like VGG-16 and AlexNet.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory access overhead in CNN accelerators

Enhances area and energy efficiency in spatial computing

Improves dataflow in systolic arrays for convolutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes shadow registers for memory efficiency

Shares shift registers between architectural slices

Implements 576 PEs in 22 nm technology

🔎 Similar Papers

No similar papers found.