🤖 AI Summary
Integrating custom hardware accelerators—particularly GEMM-based ones—into mainstream ML compilers remains challenging due to tight coupling between accelerator-specific optimizations and compiler internals. Method: This paper proposes a high-level, low-intrusion integration methodology for TVM that abstracts hardware scheduling interfaces to decouple accelerator characteristics from compiler implementation. Leveraging the CoSA design-space exploration framework, it automates hardware-aware scheduling optimizations—including tensor tiling, non-uniform mapping, and double buffering—without modifying TVM’s core infrastructure. Contribution/Results: Evaluated on the Gemmini accelerator, the approach achieves performance on par with hand-optimized toolchains while significantly improving developer productivity and cross-model/cross-architecture portability. The abstraction enables seamless reuse of scheduling policies across diverse accelerator microarchitectures and neural network workloads, reducing integration effort from weeks to hours.
📝 Abstract
The growing adoption of domain-specific architectures in edge computing platforms for deep learning has highlighted the efficiency of hardware accelerators. However, integrating custom accelerators into modern machine learning (ML) compilers remains a complex challenge due to the need for significant modifications in compilation layers and specialized scheduling techniques. Existing frameworks offer partial solutions and require users to navigate intricate compiler internals.
In this paper, we introduce a TVM-based compilation integration approach that targets GEMM-based deep learning accelerators. Our approach abstracts the complexities of compiler integration, enabling seamless integration of accelerators without requiring in-depth knowledge of the underlying compiler. Furthermore, we extend and incorporate design space exploration tools, specifically CoSA, to automate efficient tensor scheduling, accounting for factors such as uneven mapping and double buffering. Our framework is benchmarked on the Gemmini accelerator, demonstrating performance comparable to its specialized manually implemented toolchain.