EDEA: Efficient Dual-Engine Accelerator for Depthwise Separable Convolution with Direct Data Transfer

📅 2024-09-16
🏛️ ACM Symposium on Cloud Computing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low computational unit utilization and high intermediate data movement overhead in hardware acceleration of depthwise separable convolutions (DSC) on resource-constrained devices, this work proposes a dual-engine collaborative scheduling architecture. It integrates dedicated depthwise convolution (DWC) and pointwise convolution (PWC) processing units, enabling full utilization of all PEs across entire layers and zero-cost direct dataflow between DWC and PWC stages. A novel non-convolutional fixed-point multiply-accumulate unit is introduced, unifying dequantization, batch normalization, ReLU, and quantization for streaming execution. Additionally, DSC-oriented dataflow and configuration optimization techniques are devised. Implemented in 22 nm FDSOI technology, the chip occupies only 0.58 mm² and achieves a peak energy efficiency of 13.43 TOPS/W at 1 GHz. For MobileNetV1, it delivers an average DSC-layer efficiency of 11.13 TOPS/W and a throughput of 973.55 GOPS (8-bit).

Technology Category

Application Category

📝 Abstract
Depthwise separable convolution (DSC) has emerged as a crucial technique, especially for resource-constrained devices. In this paper, we propose a dual-engine for the DSC hardware accelerator, which enables the full utilization of depthwise convolution (DWC) and pointwise convolution (PWC) processing elements (PEs) in all DSC layers. To determine the optimal dataflow, data reuse, and configuration of the target architecture, we conduct a design space exploration using MobileNetV1 with the CIFAR10 dataset. In the architecture, we introduce an additional non-convolutional unit, which merges the dequantization, batch normalization (BN), ReLU, and quantization between DWC and PWC into a simple fixed-point multiplication and addition operation. This also reduces the intermediate data access between the DWC and PWC, enabling streaming operation and reducing latency. The proposed DSC dual-engine accelerator is implemented using the 22nm FDSOI technology from GlobalFoundries, occupying an area of $0.58 mathbf{~ m m}^{2}$. After signoff, it can operate at 1 GHz at TT corner, achieving a peak energy efficiency of 13.43 TOPS/W with a throughput of 973.55 GOPS with 8-bit precision. The average energy efficiency of all DSC layers on MobileNetV1 is 11.13 TOPS/W, demonstrating substantial hardware efficiency improvements for DSC-based applications.
Problem

Research questions and friction points this paper is trying to address.

Optimizes depthwise separable convolution for resource-constrained devices.
Introduces dual-engine accelerator for efficient DWC and PWC processing.
Reduces latency and improves energy efficiency in DSC-based applications.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-engine optimizes depthwise and pointwise convolution.
Non-convolutional unit merges operations, reduces data access.
22nm FDSOI achieves high energy efficiency, low latency.
🔎 Similar Papers
No similar papers found.
Y
Yi Chen
Chair of Integrated Digital Systems and Circuit Design, RWTH Aachen University, Germany
Jie Lou
Jie Lou
Xiaohongshu
AlignmentRLHF
M
Malte Wabnitz
Chair of Integrated Digital Systems and Circuit Design, RWTH Aachen University, Germany
J
Johnson Loh
Chair of Integrated Digital Systems and Circuit Design, RWTH Aachen University, Germany
Tobias Gemmeke
Tobias Gemmeke
iDS, RWTH Aachen University
VLSI Design