Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs

๐Ÿ“… 2024-12-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address stringent resource and power constraints for edge AI inference, this paper proposes Tempus Coreโ€”a temporal-unary convolutional processor compatible with the NVDLA dataflow. Methodologically, it introduces the first temporal-unary-binary (TUB) multiplier architecture, enabling seamless hardware integration of unary computing with commercial deep learning accelerators at both dataflow and interface levels. Fabricated in 45 nm CMOS, Tempus Core employs temporal-encoded unary computation, NVDLA-compatible PE array reconfiguration, and backend optimizations. Results show that a single PE reduces area and power by 59.3% and 15.3%, respectively, versus NVDLAโ€™s CMAC unit; a 16ร—16 array achieves 75% area and 62% power reduction. Throughput per unit area improves by 5ร— for INT8 and 4ร— for INT4. A compact 16ร—4 array operating in INT4 mode occupies only 0.017 mmยฒ and consumes 6.2 mW.

Technology Category

Application Category

๐Ÿ“ Abstract
The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA's open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm CMOS, Tempus Core's PE cell unit (PCU) yields 59.3% and 15.3% reductions in area and power consumption, respectively, over NVDLA's CMAC unit. Considering a 16x16 PE array in Tempus Core, area and power improves by 75% and 62%, respectively, while delivering 5x and 4x iso-area throughput improvements for INT8 and INT4 precisions. Post-place and route analysis of Tempus Core's PCU shows that the 16x4 PE array for INT4 precision in 45nm CMOS requires only 0.017 mm^2 die area and consumes only 6.2mW of total power. We demonstrate that area-power efficient unary-based hardware can be seamlessly integrated into conventional DLAs, paving the path for efficient unary hardware for edge AI inference.
Problem

Research questions and friction points this paper is trying to address.

Deep Neural Networks
Edge Devices
Efficient Convolution Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tempus Core
low-precision edge devices
NVDLA collaboration
๐Ÿ”Ž Similar Papers
No similar papers found.
P
Prabhu Vellaisamy
ECE Department, Carnegie Mellon University
H
Harideep Nair
ECE Department, Carnegie Mellon University
T
Thomas Kang
ECE Department, Carnegie Mellon University
Y
Yichen Ni
ECE Department, Carnegie Mellon University
H
Haoyang Fan
ECE Department, Carnegie Mellon University
B
Bin Qi
ECE Department, Carnegie Mellon University
J
Jeff Chen
ECE Department, Carnegie Mellon University
S
Shawn Blanton
ECE Department, Carnegie Mellon University
John Paul Shen
John Paul Shen
Carnegie Mellon University
Computer Architecture