Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression

📅 2025-05-03

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This paper addresses the combinatorial explosion arising from multidimensional optimization decisions—including parallel tiling, microkernel selection, and data layout—in high-performance neural network inference. We propose a dynamic programming–based program synthesis framework. Our key contributions are: (1) a dynamic programming decomposition strategy tailored to affine cost models, ensuring both search efficiency and solution quality; (2) a compressed memoization table representation using nonnegative integer (Z≥0) coordinate indexing, significantly expanding the tractable program search space; and (3) an end-to-end compilation lowering from XLA graphs to x86 assembly. The framework automatically generates high-throughput bfloat16-to-float32 vector-matrix multiplication kernels for Zen 1 CPUs, outperforming hand-tuned implementations. The resulting compiler infrastructure has been integrated into Google’s gemma.cpp.

Technology Category

Application Category

📝 Abstract

High-throughput neural network inference requires coordinating many optimization decisions, including parallel tiling, microkernel selection, and data layout. The product of these decisions forms a search space of programs which is typically intractably large. Existing approaches (e.g., auto-schedulers) often address this problem by sampling this space heuristically. In contrast, we introduce a dynamic-programming-based approach to explore more of the search space by iteratively decomposing large program specifications into smaller specifications reachable from a set of rewrites, then composing a final program from each rewrite that minimizes an affine cost model. To reduce memory requirements, we employ a novel memoization table representation, which indexes specifications by coordinates in $Z_{geq 0}$ and compresses identical, adjacent solutions. This approach can visit a much larger set of programs than prior work. To evaluate the approach, we developed Morello, a compiler which lowers specifications roughly equivalent to a few-node XLA computation graph to x86. Notably, we found that an affine cost model is sufficient to surface high-throughput programs. For example, Morello synthesized a collection of matrix multiplication benchmarks targeting a Zen 1 CPU, including a 1x2048x16384, bfloat16-to-float32 vector-matrix multiply, which was integrated into Google's gemma.cpp.

Problem

Research questions and friction points this paper is trying to address.

Optimizing high-throughput neural network inference decisions

Exploring large program search space via dynamic programming

Reducing memory needs with compressed memoization tables

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic programming for neural network optimization

Spatial compression of memoization tables

Affine cost model for high-throughput programs

🔎 Similar Papers

Unified Framework for Neural Network Compression via Decomposition and Optimal Rank Selection