๐ค AI Summary
This work addresses the challenge of deploying low-latency, small-batch neural networks with fully on-chip weight storage in extreme-edge scientific computing, where traditional programmable logic falls short. The study systematically evaluates the inference performance gap between AI Engines and programmable logic, proposing spatiotemporal dataflow optimizations tailored for AI Engines. It introduces a novel metricโLatency-Adjusted Resource Equivalence (LARE)โto rigorously delineate, for the first time, the operational regimes where AI Engines outperform programmable logic. Through comprehensive architectural analysis, microbenchmarking, spatial and API-level optimizations, and integration into the hls4ml toolchain, the authors successfully deploy multiple end-to-end neural networks, demonstrating the scalability and performance advantages of AI Engines under stringent resource constraints.
๐ Abstract
Extreme-edge scientific applications use machine learning models to analyze sensor data and make real-time decisions. Their stringent latency and throughput requirements demand small batch sizes and require that model weights remain fully on-chip. Spatial dataflow implementations are common for extreme-edge applications. Spatial dataflow works well for small networks, but it fails to scale to larger models due to inherent resource scaling limitations. AI Engines on modern FPGA SoCs offer a promising alternative with high compute density and additional on-chip memory. However, the architecture, programming model, and performance-scaling behavior of AI Engines differ fundamentally from those of the programmable logic, making direct comparison non-trivial and the benefits of using AI Engines unclear. This work addresses how and when extreme-edge scientific neural networks should be implemented on AI Engines versus programmable logic. We provide systematic architectural characterization and micro-benchmarking and introduce a latency-adjusted resource equivalence (LARE) metric that identifies when AI Engine implementations outperform programmable logic designs. We further propose spatial and API-level dataflow optimizations tailored to low-latency scientific inference. Finally, we demonstrate the successful deployment of end-to-end neural networks on AI Engines that cannot fit on programmable logic when using the hlsml toolchain.