🤖 AI Summary
This paper addresses the combinatorial explosion arising from multidimensional optimization decisions—including parallel tiling, microkernel selection, and data layout—in high-performance neural network inference. We propose a dynamic programming–based program synthesis framework. Our key contributions are: (1) a dynamic programming decomposition strategy tailored to affine cost models, ensuring both search efficiency and solution quality; (2) a compressed memoization table representation using nonnegative integer (Z≥0) coordinate indexing, significantly expanding the tractable program search space; and (3) an end-to-end compilation lowering from XLA graphs to x86 assembly. The framework automatically generates high-throughput bfloat16-to-float32 vector-matrix multiplication kernels for Zen 1 CPUs, outperforming hand-tuned implementations. The resulting compiler infrastructure has been integrated into Google’s gemma.cpp.
📝 Abstract
High-throughput neural network inference requires coordinating many optimization decisions, including parallel tiling, microkernel selection, and data layout. The product of these decisions forms a search space of programs which is typically intractably large. Existing approaches (e.g., auto-schedulers) often address this problem by sampling this space heuristically. In contrast, we introduce a dynamic-programming-based approach to explore more of the search space by iteratively decomposing large program specifications into smaller specifications reachable from a set of rewrites, then composing a final program from each rewrite that minimizes an affine cost model. To reduce memory requirements, we employ a novel memoization table representation, which indexes specifications by coordinates in $Z_{geq 0}$ and compresses identical, adjacent solutions. This approach can visit a much larger set of programs than prior work. To evaluate the approach, we developed Morello, a compiler which lowers specifications roughly equivalent to a few-node XLA computation graph to x86. Notably, we found that an affine cost model is sufficient to surface high-throughput programs. For example, Morello synthesized a collection of matrix multiplication benchmarks targeting a Zen 1 CPU, including a 1x2048x16384, bfloat16-to-float32 vector-matrix multiply, which was integrated into Google's gemma.cpp.