🤖 AI Summary
The growing demand for AI/ML workloads has widened the gap between high-level operator abstractions and efficient low-level hardware utilization. Current approaches rely heavily on hand-tuned microkernels or domain-specific libraries to achieve near-peak performance, severely limiting scalability and developer productivity. This paper proposes an MLIR-based compiler framework featuring a novel “nanokernel composition” mechanism: it automatically synthesizes architecture-specific microkernels directly in low-level IR via hierarchical abstraction and vectorized tiling scheduling—requiring no manual intervention or external libraries. The method significantly improves register utilization and instruction-level parallelism, generating nanokernels of industrial-grade quality. On mainstream CPUs, its matrix multiplication performance matches that of state-of-the-art hand-optimized libraries (e.g., Intel MKL, OpenBLAS). This work provides the first systematic validation of compiler-driven, fully automatic generation of high-performance microkernels—demonstrating feasibility, generality, and competitiveness.
📝 Abstract
The rapidly evolving landscape of AI and machine learning workloads has widened the gap between high-level domain operations and efficient hardware utilization. Achieving near-peak performance still demands deep hardware expertise-experts either handcraft target-specific kernels (e.g., DeepSeek) or rely on specialized libraries (e.g., CUTLASS)-both of which add complexity and limit scalability for most ML practitioners.
This paper introduces a compilation scheme that automatically generates scalable, high-performance microkernels by leveraging the MLIR dialects to bridge domain-level operations and processor capabilities. Our approach removes dependence on low-level libraries by enabling the compiler to auto-generate near-optimal code directly. At its core is a mechanism for composing nanokernels from low-level IR constructs with near-optimal register utilization, forming efficient microkernels tailored to each target. We implement this technique in an MLIR-based compiler supporting both vector and tile based CPU instructions. Experiments show that the generated nanokernels are of production-quality, and competitive with state-of-the-art microkernel libraries.