Fast inference with Kronecker-sparse matrices

📅 2024-05-23

📈 Citations: 1

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the high memory movement overhead (≈50% of execution time spent on tensor reordering) and low energy efficiency of Kronecker-sparse (KS) matrix multiplication on GPUs, this work proposes the first KS-structure-aware tiling memory access strategy, co-optimizing GPU’s multi-level memory hierarchy to significantly reduce redundant data reads and writes. We further introduce the first unified benchmarking framework jointly evaluating energy consumption and execution time for KS-sparse operators. Our CUDA-based kernel achieves a median 1.4× speedup and 15% energy reduction across mainstream KS problem sizes. It has been successfully integrated into Transformer inference pipelines, demonstrating practical deployment value. The core innovations lie in (i) a KS-structure-guided tiling design that exploits inherent sparsity and Kronecker factorization patterns, and (ii) a hardware–software co-designed energy-efficiency optimization paradigm tailored to KS computation.

Technology Category

Application Category

📝 Abstract

This paper benchmarks and improves existing GPU matrix multiplication algorithms specialized for Kronecker-sparse matrices, whose sparsity patterns are described by Kronecker products. These matrices have recently gained popularity as replacements for dense matrices in neural networks because they preserve accuracy while using fewer parameters. We present the first energy and time benchmarks for the multiplication with such matrices, helping users identify scenarios where Kronecker-sparse matrices are more time- and energy-efficient than their dense counterparts. Our benchmark also reveals that specialized implementations spend up to 50% of their total runtime on memory rewriting operations. To address the challenge of reducing memory transfers, we introduce a new so-called tiling strategy adapted to the Kronecker-sparsity structure, which reduces reads and writes between levels of GPU memory. We implement this tiling strategy in a new CUDA kernel that achieves a median speed-up of x1.4, while also cutting energy consumption by 15%. We further demonstrate the broader impact of our results by applying the new kernel to accelerate transformer inference.

Problem

Research questions and friction points this paper is trying to address.

Reducing high data movement costs in KS matrix multiplication

Improving GPU kernel efficiency for Kronecker-sparse matrices

Enhancing speed and energy efficiency in deep learning models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fused output-stationary GPU kernel

Reduces global memory traffic threefold

Heuristic predicts performance improvements

🔎 Similar Papers

No similar papers found.