Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The growing demand for AI/ML workloads has widened the gap between high-level operator abstractions and efficient low-level hardware utilization. Current approaches rely heavily on hand-tuned microkernels or domain-specific libraries to achieve near-peak performance, severely limiting scalability and developer productivity. This paper proposes an MLIR-based compiler framework featuring a novel “nanokernel composition” mechanism: it automatically synthesizes architecture-specific microkernels directly in low-level IR via hierarchical abstraction and vectorized tiling scheduling—requiring no manual intervention or external libraries. The method significantly improves register utilization and instruction-level parallelism, generating nanokernels of industrial-grade quality. On mainstream CPUs, its matrix multiplication performance matches that of state-of-the-art hand-optimized libraries (e.g., Intel MKL, OpenBLAS). This work provides the first systematic validation of compiler-driven, fully automatic generation of high-performance microkernels—demonstrating feasibility, generality, and competitiveness.

Technology Category

Application Category

📝 Abstract
The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts either handcraft target-specific kernels (e.g., DeepSeek) or rely on specialized libraries (e.g., CUTLASS)-both of which add complexity and limit scalability for most ML practitioners. This paper introduces a compilation scheme that automatically generates scalable, high-performance microkernels by leveraging the MLIR dialects to bridge domain-level operations and processor capabilities. Our approach removes dependence on low-level libraries by enabling the compiler to auto-generate near-optimal code directly. At its core is a mechanism for composing nanokernels from low-level IR constructs with near-optimal register utilization, forming efficient microkernels tailored to each target. We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions. Experiments show that the generated nanokernels are of production-quality, and competitive with state-of-the-art microkernel libraries.
Problem

Research questions and friction points this paper is trying to address.

Automatically generates scalable high-performance matrix multiplication microkernels
Bridges domain operations and processor capabilities using MLIR dialects
Eliminates dependence on handcrafted kernels and specialized libraries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiler-generated nanokernels replace handcrafted libraries
MLIR dialects bridge domain operations with hardware capabilities
Auto-generated code achieves near-optimal register utilization
🔎 Similar Papers
No similar papers found.
A
Arun Thangamani
Intel Advanced Technologies Group, Intel Corporation, India
M
Md Asghar Ahmad Shahid
Intel Advanced Technologies Group, Intel Corporation, India
A
Adam Siemieniuk
Intel Advanced Technologies Group, Intel Corporation, Switzerland
R
Rolf Morel
Intel Advanced Technologies Group, Intel Corporation, UK
R
Renato Golin
Intel Advanced Technologies Group, Intel Corporation, UK
Alexander Heinecke
Alexander Heinecke
Intel Fellow at Intel Labs
AI/HPC and Parallel Computing