🤖 AI Summary
To address the high computational cost of matrix multiplication (MatMul) in deep neural networks (DNNs) and the common trade-offs—reduced parameter count or degraded accuracy—in existing GPU-accelerated MatMul methods, this paper proposes Strassen-Tile (STL), a novel learnable operator. STL employs a differentiable blockwise basis transformation, TensorCore-optimized element-wise multiplication, and a decoding mapping to realize black-box approximate MatMul via an algebraic pipelining scheme. As the first GPU-native MatMul replacement that preserves full parameter count while *improving* accuracy, STL achieves higher representational capacity and lower FLOPs. On T2T-ViT, it reduces FLOPs by 2.7× while boosting ImageNet Top-1 accuracy by 0.5%. In TinyLlama fine-tuning, STL matches the accuracy of 2:4 structured sparsity at 2.2× FLOPs speedup—surpassing the 1.7× speedup of the 2:4 baseline—and further explores connections between group-theoretic encoding and structured sparsity.
📝 Abstract
We propose a cheaper alternative bilinear operator to matrix-multiplication in deep neural networks (DNNs). Unlike many stubborn attempts to accelerate MatMuls in DNN inference, this operator is supported by capabilities of existing GPU hardware, most notably NVIDIA TensorCores. To our knowledge, this is the first GPU-native acceleration technique which emph{does not decrease} (in fact, increases) the number of trainable parameters of the network, mitigating the accuracy-loss of compression-based techniques. Hence, this operator is at the same time more expressive than MatMul, yet requires substantially emph{fewer} FLOPs to evaluate. We term this new operator emph{Strassen-Tile} (STL). The main idea behind STL$(X,W)$ is a emph{local} change-of-basis (learnable encoder) on weights and activation emph{tiles}, after which we perform batched emph{elementwise} products between tiles, and a final decoding transformation (inspired by algebraic pipelines from fast matrix and polynomial multiplication). We compare STL against two benchmarks. The first one is SoTA T2T-ViT on Imagenet-1K. Here we show that replacing emph{all} linear layers with STL and training from scratch, results in factor x2.7 reduction in FLOPs with a 0.5 emph{accuracy improvement}. Our second speed-accuracy comparison benchmark for pretrained LLMs is the most practical GPU-acceleration technique, wofour structured Sparsity. Finetuning TinyLlama cite{tinyllama24} with STL layers on the Slim Pajama dataset, achieves similar accuracy to 2:4, with x2.2 FLOP speedup compared to x1.7 of the latter. Finally, we discuss a group-theoretic approach for discovering emph{universal} encoders for STL, which could lead to fast emph{black-box} acceleration via approximate matrix-multiplication (AMM).