Accelerating Transposed Convolutions on FPGA-based Edge Devices

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Transposed convolution (TCONV) on FPGA-based edge devices suffers from complex output mapping, computational redundancy, and low energy efficiency due to conventional input-oriented mapping (IOM). Method: This paper proposes MM2IM, a hardware-software co-design acceleration framework that uniquely integrates matrix multiplication (MatMul) and col2im operations to fundamentally restructure the TCONV computation flow—eliminating both redundant computations and sum-overlap during accumulation. Contribution/Results: Implemented via the SECDA-TFLite toolchain, MM2IM delivers a configurable accelerator achieving 1.9× average speedup across 261 configurations, up to 4.2× in DCGAN and pix2pix, and 2.4× energy efficiency improvement. It attains >2× higher GOPs/DSP than state-of-the-art TCONV accelerators. To our knowledge, this is the first work enabling high-density, low-overhead TCONV hardware optimization on resource-constrained edge FPGAs, significantly enhancing deployment efficiency of generative AI models.

Technology Category

Application Category

📝 Abstract

Transposed Convolutions (TCONV) enable the up-scaling mechanism within generative Artificial Intelligence (AI) models. However, the predominant Input-Oriented Mapping (IOM) method for implementing TCONV has complex output mapping, overlapping sums, and ineffectual computations. These inefficiencies further exacerbate the performance bottleneck of TCONV and generative models on resource-constrained edge devices. To address this problem, in this paper we propose MM2IM, a hardware-software co-designed accelerator that combines Matrix Multiplication (MatMul) with col2IM to process TCONV layers on resource-constrained edge devices efficiently. Using the SECDA-TFLite design toolkit, we implement MM2IM and evaluate its performance across 261 TCONV problem configurations, achieving an average speedup of 1.9x against a dual-thread ARM Neon optimized CPU baseline. We then evaluate the performance of MM2IM on a range of TCONV layers from well-known generative models achieving up to 4.2x speedup, and compare it against similar resource-constrained TCONV accelerators, outperforming them by at least 2x GOPs/DSP. Finally, we evaluate MM2IM on the DCGAN and pix2pix GAN models, achieving up to 3x speedup and 2.4x energy reduction against the CPU baseline.

Problem

Research questions and friction points this paper is trying to address.

Inefficient TCONV implementation on edge devices

Performance bottleneck in generative AI models

Resource constraints in edge device acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

MM2IM combines MatMul with col2IM

Hardware-software co-designed accelerator

Achieves speedup on edge devices

🔎 Similar Papers

No similar papers found.