FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

230K/year
🤖 AI Summary
This work proposes FalconGEMM, a cross-platform framework designed to overcome hardware peak performance limitations and enhance the efficiency of large language model training and inference by enabling the first practical deployment of low-complexity matrix multiplication algorithms (LCMAs) on heterogeneous hardware. The framework integrates three core components: cross-platform code generation supporting GPUs (H20, A100) and CPUs (ARM, x86), execution optimization leveraging group-level parallelism and on-chip data reuse, and a lightweight performance model guiding optimal strategy selection. Experimental results demonstrate that FalconGEMM outperforms state-of-the-art GEMM libraries by 7.59%–17.85% across diverse hardware platforms and data types, and achieves speedups of 12.41%–55.61% over existing LCMA approaches such as AlphaTensor, thereby bridging the gap between theoretical algorithms and production-grade applications.
📝 Abstract
Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86) architectures with multiple data types. FalconGEMM succeeds in delivering peak breaking performance and outperforms GEMM libraries (e.g., cuBLAS, CUTLASS, Intel MKL, etc) by 7.59%-17.85% and LCMA competitors like AlphaTensor by 12.41%-55.61%. Our framework makes the theoretical promise of LCMAs practical for production deployment across the heterogeneous landscape of modern hardware.
Problem

Research questions and friction points this paper is trying to address.

Matrix Multiplication
Lower-Complexity Algorithms
Deep Learning
LLM
Hardware Acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lower-Complexity Matrix Multiplication
Cross-platform Optimization
Group-Parallel Execution
Analytical Performance Modeling
Peak-Breaking GEMM
🔎 Similar Papers
No similar papers found.
H
Honglin Zhu
Tencent, Shenzhen, China
J
Jiaping Cao
Tencent, Shenzhen, China; The Hong Kong Polytechnic University, Hong Kong
J
Jiang Shao
NVIDIA, Beijing, China
Siyuan Feng
Siyuan Feng
Shanghai Innovation Institute
Machine Learning Systems
Q
Qian Qiu
Tencent, Shenzhen, China
Peng Chen
Peng Chen
RIKEN Center for Computational Science (R-CCS)
HPCGPGPUMachine LearningImage Processing
X
Xu Zhang
Southern University of Science and Technology, Shenzhen, China; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Y
Yixian Zhou
Tencent, Shenzhen, China
Man Lung Yiu
Man Lung Yiu
Professor, Hong Kong Polytechnic University
Database
G
Guang Ji
NVIDIA, Beijing, China
M
Minwen Deng
Tencent, Shenzhen, China
Wenxi Zhu
Wenxi Zhu
Tencent
High Performance ComputingCompiler
J
Jintao Meng
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China