3D-TrIM: A Memory-Efficient Spatial Computing Architecture for Convolution Workloads

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory-access overhead and von Neumann bottleneck induced by convolutional computation in CNN accelerators, this paper proposes a three-dimensional extended triangular dataflow architecture. The architecture introduces a novel triangular-motion-based local reuse mechanism, synergistically integrating shadow registers and cross-slice shared shift-register buffers—while preserving the systolic array structure—to significantly enhance data reuse efficiency and hardware resource utilization. Implemented in 22 nm CMOS technology, the design instantiates a 576-PE systolic array achieving 4.47 TOPS/mm² area efficiency and 4.54 TOPS/W energy efficiency. On VGG-16 and AlexNet, memory accesses are reduced to just 29.7% of those required by TrIM, demonstrating substantial improvements in alleviating memory bandwidth pressure and reducing energy consumption.

Technology Category

Application Category

📝 Abstract
The Von Neumann bottleneck, which relates to the energy cost of moving data from memory to on-chip core and vice versa, is a serious challenge in state-of-the-art AI architectures, like Convolutional Neural Networks' (CNNs) accelerators. Systolic arrays exploit distributed processing elements that exchange data with each other, thus mitigating the memory cost. However, when involved in convolutions, data redundancy must be carefully managed to avoid significant memory access overhead. To overcome this problem, TrIM has been recently proposed. It features a systolic array based on an innovative dataflow, where input feature map (ifmap) activations are locally reused through a triangular movement. However, ifmaps still suffer from memory accesses overhead. This work proposes 3D-TrIM, an upgraded version of TrIM that addresses the memory access overhead through few extra shadow registers. In addition, due to a change in the architectural orientation, the local shift register buffers are now shared between different slices, thus improving area and energy efficiency. An architecture of 576 processing elements is implemented on commercial 22 nm technology and achieves an area efficiency of 4.47 TOPS/mm$^2$ and an energy efficiency of 4.54 TOPS/W. Finally, 3D-TrIM outperforms TrIM by up to $3.37 imes$ in terms of operations per memory access considering CNN topologies like VGG-16 and AlexNet.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory access overhead in CNN accelerators
Enhances area and energy efficiency in spatial computing
Improves dataflow in systolic arrays for convolutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes shadow registers for memory efficiency
Shares shift registers between architectural slices
Implements 576 PEs in 22 nm technology
🔎 Similar Papers
No similar papers found.
C
Cristian Sestito
Centre for Electronics Frontiers, Institute for Integrated Micro and Nano Systems, School of Engineering, The University of Edinburgh, EH9 3BF, Edinburgh, United Kingdom
A
Ahmed J. Abdelmaksoud
Centre for Electronics Frontiers, Institute for Integrated Micro and Nano Systems, School of Engineering, The University of Edinburgh, EH9 3BF, Edinburgh, United Kingdom
Shady Agwa
Shady Agwa
Centre for Electronics Frontiers CEF, University of Edinburgh
In-Memory ComputingEmerging TechnologiesAI HardwareStochastic ComputingResilient Micro
Themis Prodromakis
Themis Prodromakis
Regius Chair of Engineering, Centre for Electronics Frontiers, University of Edinburgh
NanotechnologyMemristorsNanoelectronicsSensorsPoint-of-care diagnostics