🤖 AI Summary
High-level synthesis (HLS) tools—such as Chisel and commercial HLS compilers—often produce circuits with inferior performance compared to hand-designed hardware in high-performance computing (HPC) accelerator design.
Method: This paper proposes a hierarchical algorithmic decomposition and automated evaluation framework that abstracts mathematical kernels (e.g., Fourier transforms, matrix multiplication, QR decomposition) into reusable building blocks, uniformly implemented across multiple abstraction levels (RTL, Chisel, C++ HLS), and systematically benchmarked for resource utilization, timing, and operating frequency.
Contribution/Results: The framework establishes the first cross-abstraction-level fair benchmarking methodology, enabling fine-grained identification of inefficiencies introduced by HLS compilers. Experimental evaluation demonstrates significantly improved accuracy and interpretability in pinpointing design bottlenecks, providing quantitative guidance for both HLS tool optimization and practical hardware design.
📝 Abstract
Developing efficient hardware accelerators for mathematical kernels used in scientific applications and machine learning has traditionally been a labor-intensive task. These accelerators typically require low-level programming in Verilog or other hardware description languages, along with significant manual optimization effort. Recently, to alleviate this challenge, high-level hardware design tools like Chisel and High-Level Synthesis have emerged. However, as with any compiler, some of the generated hardware may be suboptimal compared to expert-crafted designs. Understanding where these inefficiencies arise is crucial, as it provides valuable insights for both users and tool developers. In this paper, we propose a methodology to hierarchically decompose mathematical kernels - such as Fourier transforms, matrix multiplication, and QR factorization - into a set of common building blocks or primitives. Then the primitives are implemented in the different programming environments, and the larger algorithms get assembled. Furthermore, we employ an automatic approach to investigate the achievable frequency and required resources. Performing this experimentation at each level will provide fairer comparisons between designs and offer guidance for both tool developers and hardware designers to adopt better practices.