Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs

📅 2025-03-15

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address the high computational cost of matrix multiplication (MatMul) in deep neural networks (DNNs) and the common trade-offs—reduced parameter count or degraded accuracy—in existing GPU-accelerated MatMul methods, this paper proposes Strassen-Tile (STL), a novel learnable operator. STL employs a differentiable blockwise basis transformation, TensorCore-optimized element-wise multiplication, and a decoding mapping to realize black-box approximate MatMul via an algebraic pipelining scheme. As the first GPU-native MatMul replacement that preserves full parameter count while *improving* accuracy, STL achieves higher representational capacity and lower FLOPs. On T2T-ViT, it reduces FLOPs by 2.7× while boosting ImageNet Top-1 accuracy by 0.5%. In TinyLlama fine-tuning, STL matches the accuracy of 2:4 structured sparsity at 2.2× FLOPs speedup—surpassing the 1.7× speedup of the 2:4 baseline—and further explores connections between group-theoretic encoding and structured sparsity.

Technology Category

Application Category

📝 Abstract

We propose a cheaper alternative bilinear operator to matrix-multiplication in deep neural networks (DNNs). Unlike many stubborn attempts to accelerate MatMuls in DNN inference, this operator is supported by capabilities of existing GPU hardware, most notably NVIDIA TensorCores. To our knowledge, this is the first GPU-native acceleration technique which emph{does not decrease} (in fact, increases) the number of trainable parameters of the network, mitigating the accuracy-loss of compression-based techniques. Hence, this operator is at the same time more expressive than MatMul, yet requires substantially emph{fewer} FLOPs to evaluate. We term this new operator emph{Strassen-Tile} (STL). The main idea behind STL$(X,W)$ is a emph{local} change-of-basis (learnable encoder) on weights and activation emph{tiles}, after which we perform batched emph{elementwise} products between tiles, and a final decoding transformation (inspired by algebraic pipelines from fast matrix and polynomial multiplication). We compare STL against two benchmarks. The first one is SoTA T2T-ViT on Imagenet-1K. Here we show that replacing emph{all} linear layers with STL and training from scratch, results in factor x2.7 reduction in FLOPs with a 0.5 emph{accuracy improvement}. Our second speed-accuracy comparison benchmark for pretrained LLMs is the most practical GPU-acceleration technique, wofour structured Sparsity. Finetuning TinyLlama cite{tinyllama24} with STL layers on the Slim Pajama dataset, achieves similar accuracy to 2:4, with x2.2 FLOP speedup compared to x1.7 of the latter. Finally, we discuss a group-theoretic approach for discovering emph{universal} encoders for STL, which could lead to fast emph{black-box} acceleration via approximate matrix-multiplication (AMM).

Problem

Research questions and friction points this paper is trying to address.

Proposes a GPU-efficient alternative to matrix-multiplication in DNNs.

Introduces Strassen-Tile operator to reduce FLOPs while increasing trainable parameters.

Demonstrates improved accuracy and FLOP reduction in benchmarks like T2T-ViT and TinyLlama.

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-native acceleration with increased trainable parameters

Strassen-Tile operator reduces FLOPs, enhances expressiveness

Local change-of-basis and elementwise products for efficiency

🔎 Similar Papers

NonGEMM Bench: Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads