🤖 AI Summary
To address the high memory-access overhead and von Neumann bottleneck induced by convolutional computation in CNN accelerators, this paper proposes a three-dimensional extended triangular dataflow architecture. The architecture introduces a novel triangular-motion-based local reuse mechanism, synergistically integrating shadow registers and cross-slice shared shift-register buffers—while preserving the systolic array structure—to significantly enhance data reuse efficiency and hardware resource utilization. Implemented in 22 nm CMOS technology, the design instantiates a 576-PE systolic array achieving 4.47 TOPS/mm² area efficiency and 4.54 TOPS/W energy efficiency. On VGG-16 and AlexNet, memory accesses are reduced to just 29.7% of those required by TrIM, demonstrating substantial improvements in alleviating memory bandwidth pressure and reducing energy consumption.
📝 Abstract
The Von Neumann bottleneck, which relates to the energy cost of moving data from memory to on-chip core and vice versa, is a serious challenge in state-of-the-art AI architectures, like Convolutional Neural Networks' (CNNs) accelerators. Systolic arrays exploit distributed processing elements that exchange data with each other, thus mitigating the memory cost. However, when involved in convolutions, data redundancy must be carefully managed to avoid significant memory access overhead. To overcome this problem, TrIM has been recently proposed. It features a systolic array based on an innovative dataflow, where input feature map (ifmap) activations are locally reused through a triangular movement. However, ifmaps still suffer from memory accesses overhead. This work proposes 3D-TrIM, an upgraded version of TrIM that addresses the memory access overhead through few extra shadow registers. In addition, due to a change in the architectural orientation, the local shift register buffers are now shared between different slices, thus improving area and energy efficiency. An architecture of 576 processing elements is implemented on commercial 22 nm technology and achieves an area efficiency of 4.47 TOPS/mm$^2$ and an energy efficiency of 4.54 TOPS/W. Finally, 3D-TrIM outperforms TrIM by up to $3.37 imes$ in terms of operations per memory access considering CNN topologies like VGG-16 and AlexNet.