π€ AI Summary
Traditional CGRA compilation approaches struggle to effectively exploit parallelism in computational kernels containing implicit matrix multiplication (mmul) operations, leading to suboptimal performance. This work proposes a novel compilation framework that integrates the polyhedral model with parameterized mmul scheduling. By applying loop permutation, tiling, and polyhedral transformations, the framework automatically identifies hidden mmul patterns in source code and replaces them with pre-optimized kernels, while the remaining code is compiled independently. This hybrid strategy significantly improves CGRA resource utilization, achieving up to a 9.1Γ speedup on various benchmarks featuring implicit mmul operations. Moreover, the approach demonstrates scalability across CGRA architectures of different sizes.
π Abstract
Modern computing workloads commonly involve matrix-matrix multiplication (mmul) as a core computing pattern. Coarse-Grained Reconfigurable Arrays (CGRAs) can flexibly and efficiently support it, since they combine operation-level reconfigurability and high energy efficiency. However, mapping computational kernels that include mmul with state-of-the-art compilation strategies often leads to suboptimal results, since its multi-dimensional structure hampers the uncovering of its inherent parallelism and, ultimately, runtime performance. Here, we take a different position: we introduce a specialized mmul CGRA kernel schedule, parametrizable across different CGRA sizes. Then, we describe a novel compilation methodology that adapts program representations to effectively leverage it, employing polyhedral transformations to analyze complex computational patterns and expose hidden mmul operations through loop reordering and splitting. The identified patterns are then substituted with optimized assembly, while the remaining program sections are compiled independently. CGRA configurations are then generated, encompassing pre-compiled and compiled parts. Our strategy maximizes resource utilization and ultimately run-time performance, even when mmul is not directly apparent in the source code. The experimental results show speedups up to 9.1x across different benchmarks that contain hidden mmuls and CGRA instances of various sizes.