Unified Kernel-Segregated Transpose Convolution Operation

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address redundant computation and thread underutilization in transposed convolution caused by conventional kernel decomposition when output feature map dimensions are odd, this paper proposes a unified kernel-driven paradigm coordinating four sub-kernels. A single fused CUDA kernel orchestrates the four logical sub-kernels, restructuring the transposed convolution operator to eliminate invalid computations and spurious memory accesses for odd-sized outputs. Integrating kernel fusion, lightweight memory optimization, and novel scheduling architecture, the approach significantly improves computational density and memory efficiency without sacrificing accuracy. Experimental results demonstrate an average speedup of 2.03× on an RTX 2070 GPU (3.89× on CPU), a 3.5× acceleration for transposed convolution layers in GANs, and a 35 MB reduction in GPU memory footprint during EB-GAN deployment.

Technology Category

Application Category

📝 Abstract
The optimization of the transpose convolution layer for deep learning applications is achieved with the kernel segregation mechanism. However, kernel segregation has disadvantages, such as computing extra elements to obtain the output feature map with odd dimensions while launching a thread. To mitigate this problem, we introduce a unified kernel segregation approach that limits the usage of memory and computational resources by employing one unified kernel to execute four sub-kernels. The findings reveal that the suggested approach achieves an average computational speedup of 2.03x (3.89x) when tested on specific datasets with an RTX 2070 GPU (Intel Xeon CPU). The ablation study shows an average computational speedup of 3.5x when evaluating the transpose convolution layers from well-known Generative Adversarial Networks (GANs). The implementation of the proposed method for the transpose convolution layers in the EB-GAN model demonstrates significant memory savings of up to 35 MB.
Problem

Research questions and friction points this paper is trying to address.

Optimizes transpose convolution for deep learning efficiency
Reduces memory and computational resource usage
Improves speed in GANs and other models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified kernel segregation reduces memory usage.
Single kernel executes four sub-kernels efficiently.
Achieves significant computational speedup and memory savings.
🔎 Similar Papers
No similar papers found.
Vijay Srinivas Tida
Vijay Srinivas Tida
Assistant Professor
Machine LearningDeep LearningVLSINatural Language ProcessingDifferential Privacy
M
Md. Imran Hossen
University of Louisiana at Lafayette
Liqun Shan
Liqun Shan
University of Louisiana at Lafayette
AI SecurityCyber-Physical System SecurityComputer VisionData Analysis
S
Sai Venkatesh Chilukoti
University of Louisiana at Lafayette
S
Sonya Hsu
University of Louisiana at Lafayette
X
X. Hei
University of Louisiana at Lafayette