🤖 AI Summary
Existing scientific computing codes are difficult to efficiently port to specialized architectures such as AMD AI Engine, often requiring extensive manual refactoring. This work proposes a tensor abstraction–based compilation approach that automatically elevates generic loops to tensor semantics by parsing lightweight OpenMP annotations, and constructs an end-to-end compilation pipeline to map computations onto the AI Engine execution model. The method significantly reduces programming complexity through minimal OpenMP directives and enables CPU–NPU cooperative scheduling. Experimental results on six scientific and AI kernel benchmarks show that the NPU achieves higher energy efficiency than a multi-core CPU at float32 precision; for two kernels, cooperative execution yields a 40% performance improvement and 15% energy reduction.
📝 Abstract
It has been demonstrated that specialised architectures, such as FPGAs and AMD's AI Engines (AIEs), have the potential to deliver energy and performance advantages for scientific computing. Given the integration of AIEs into AMD's CPUs, this is an interesting potential avenue especially when executing on the edge or making better use of local compute constrained resources. However, a major challenge is in enabling existing codes to run on this architecture without extensive modification. Put simply, it requires significant expertise and time to port codes to the AIE's execution model.
In this paper we explore a compilation pipeline for efficiently mapping loops in general purpose, scientific codes to AIEs. Lifting the semantics of an application into tensors, we demonstrate that this is able to capture the intention of general purpose loops annotated with OpenMP and such high-level tensor information provides a richness that is effective when mapping to the AIEs. Requiring only an OpenMP decorated loop, our approach significantly reduces code complexity when targeting the architecture. For six kernel benchmarks, representing AI and scientific computing, using our approach the NPU performs comparatively to the multicore CPU for float32, in all cases at reduced energy to solution. For two scientific computing kernels running across both the CPU and NPU together delivers up to a 40% improvement in performance and 15% reduction in energy usage compared to the CPU alone.