Library Liberation: Competitive Performance Matmul Through Compiler-composed Nanokernels

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

The growing demand for AI/ML workloads has widened the gap between high-level operator abstractions and efficient low-level hardware utilization. Current approaches rely heavily on hand-tuned microkernels or domain-specific libraries to achieve near-peak performance, severely limiting scalability and developer productivity. This paper proposes an MLIR-based compiler framework featuring a novel “nanokernel composition” mechanism: it automatically synthesizes architecture-specific microkernels directly in low-level IR via hierarchical abstraction and vectorized tiling scheduling—requiring no manual intervention or external libraries. The method significantly improves register utilization and instruction-level parallelism, generating nanokernels of industrial-grade quality. On mainstream CPUs, its matrix multiplication performance matches that of state-of-the-art hand-optimized libraries (e.g., Intel MKL, OpenBLAS). This work provides the first systematic validation of compiler-driven, fully automatic generation of high-performance microkernels—demonstrating feasibility, generality, and competitiveness.

Technology Category

Application Category

📝 Abstract

The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts either handcraft target-specific kernels (e.g., DeepSeek) or rely on specialized libraries (e.g., CUTLASS)-both of which add complexity and limit scalability for most ML practitioners. This paper introduces a compilation scheme that automatically generates scalable, high-performance microkernels by leveraging the MLIR dialects to bridge domain-level operations and processor capabilities. Our approach removes dependence on low-level libraries by enabling the compiler to auto-generate near-optimal code directly. At its core is a mechanism for composing nanokernels from low-level IR constructs with near-optimal register utilization, forming efficient microkernels tailored to each target. We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions. Experiments show that the generated nanokernels are of production-quality, and competitive with state-of-the-art microkernel libraries.

Problem

Research questions and friction points this paper is trying to address.

Automatically generates scalable high-performance matrix multiplication microkernels

Bridges domain operations and processor capabilities using MLIR dialects

Eliminates dependence on handcrafted kernels and specialized libraries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiler-generated nanokernels replace handcrafted libraries

MLIR dialects bridge domain operations with hardware capabilities

Auto-generated code achieves near-optimal register utilization

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD. You will also be eligible for equity and benefits.

US, WA, Redmond / US, TX, Remote / US, WA, Remote

Authors to Follow