FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work proposes FalconGEMM, a cross-platform framework designed to overcome hardware peak performance limitations and enhance the efficiency of large language model training and inference by enabling the first practical deployment of low-complexity matrix multiplication algorithms (LCMAs) on heterogeneous hardware. The framework integrates three core components: cross-platform code generation supporting GPUs (H20, A100) and CPUs (ARM, x86), execution optimization leveraging group-level parallelism and on-chip data reuse, and a lightweight performance model guiding optimal strategy selection. Experimental results demonstrate that FalconGEMM outperforms state-of-the-art GEMM libraries by 7.59%–17.85% across diverse hardware platforms and data types, and achieves speedups of 12.41%–55.61% over existing LCMA approaches such as AlphaTensor, thereby bridging the gap between theoretical algorithms and production-grade applications.

📝 Abstract

Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86) architectures with multiple data types. FalconGEMM succeeds in delivering peak breaking performance and outperforms GEMM libraries (e.g., cuBLAS, CUTLASS, Intel MKL, etc) by 7.59%-17.85% and LCMA competitors like AlphaTensor by 12.41%-55.61%. Our framework makes the theoretical promise of LCMAs practical for production deployment across the heterogeneous landscape of modern hardware.

Problem

Research questions and friction points this paper is trying to address.

Matrix Multiplication

Lower-Complexity Algorithms

Deep Learning

LLM

Hardware Acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lower-Complexity Matrix Multiplication

Cross-platform Optimization

Group-Parallel Execution