Axon: A novel systolic array architecture for improved run time and energy efficient GeMM and Conv operation with on-chip im2col

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the energy-efficiency bottleneck in systolic arrays caused by data-read latency during matrix multiplication and convolution operations in AI models (e.g., YOLOv3, ResNet50), this work proposes a novel diagonal injection plus bidirectional propagation data orchestration mechanism. The mechanism natively supports im2col within the array, enabling the first hardware-level, low-overhead im2col implementation—eliminating data skew and redundant data movement inherent in conventional software- or cache-based approaches. Built upon a customized systolic array architecture integrated with an on-chip im2col unit, the design is validated in ASAP 7nm technology. Experimental results demonstrate a 2.17× improvement in inference energy efficiency, a 1.2× increase in throughput, and up to a 2× reduction in execution time, with only 0.211% area overhead and 1.6% power overhead.

Technology Category

Application Category

📝 Abstract
General matrix multiplication (GeMM) is a core operation in virtually all AI applications. Systolic array (SA) based architectures have shown great promise as GeMM hardware accelerators thanks to their speed and energy efficiency. Unfortunately, SAs incur a linear delay in filling the operands, due to unidirectional propagation via pipeline latches. In this work, we propose a novel in-array data orchestration technique in SAs where we enable data feeding on the principal diagonal followed by bi-directional propagation. This improves the runtime by up to 2X at minimal hardware overhead. In addition, the proposed data orchestration enables convolution lowering (known as im2col) using a simple hardware support to fully exploit input feature map reuse opportunity and significantly lower the off-chip memory traffic resulting in 1.2X throughput improvement and 2.17X inference energy reduction during YOLOv3 and RESNET50 workload on average. In contrast, conventional data orchestration would require more elaborate hardware and control signals to implement im2col in hardware because of the data skew. We have synthesized and conducted place and route for 16X16 systolic arrays based on the novel and conventional orchestrations using ASAP 7nm PDK and found that our proposed approach results in 0.211% area and 1.6% power overheads.
Problem

Research questions and friction points this paper is trying to address.

Pulsed Array Design
Matrix Multiplication Efficiency
AI Application Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

im2col optimization
data propagation enhancement
AI task acceleration
🔎 Similar Papers
No similar papers found.
M
Md. Mizanur Rahaman Nayan
Department of Electrical and Computer Engineering, Georgia Institute of Technology, USA
Ritik Raj
Ritik Raj
PhD Student, Georgia Institute of Technology
Computer Architecture
G
Gouse Basha Shaik
Department of Electrical and Computer Engineering, Georgia Institute of Technology, USA
Tushar Krishna
Tushar Krishna
Associate Professor, Georgia Tech
Computer ArchitectureInterconnection NetworksNetwork-on-ChipDeep Learning Accelerators
Azad Naeemi
Azad Naeemi
Professor of Electrical and Computer Engineering, Georgia Institute of Technology
Nanoelectronics