🤖 AI Summary
Current edge AI inference architectures overemphasize peak TOPS, leading to inflated silicon cost, distorted real-world performance, and low computational utilization. To address this, we propose a co-design methodology integrating NPU microarchitecture and compiler optimization. Our approach features a data-driven, flexible NPU microarchitecture coupled with a constraint-programming–driven compiler framework that enables workload-aware dataflow scheduling, fine-grained resource allocation, and coordinated memory hierarchy management. Evaluated under identical peak TOPS and memory capacity constraints, our method achieves an average 1.8× speedup in inference latency (up to 4× peak improvement) over conventional NPUs. Notably, it even outperforms traditional NPUs with double the hardware resources by up to 3.3×, while significantly improving energy efficiency and hardware utilization. The design thus delivers high performance, architectural flexibility, and low implementation overhead—enabling cost-effective, scalable edge AI acceleration.
📝 Abstract
Neural Processing Units (NPUs) are key to enabling efficient AI inference in resource-constrained edge environments. While peak tera operations per second (TOPS) is often used to gauge performance, it poorly reflects real-world performance and typically rather correlates with higher silicon cost. To address this, architects must focus on maximizing compute utilization, without sacrificing flexibility. This paper presents the eIQ Neutron efficient-NPU, integrated into a commercial flagship MPU, alongside co-designed compiler algorithms. The architecture employs a flexible, data-driven design, while the compiler uses a constrained programming approach to optimize compute and data movement based on workload characteristics. Compared to the leading embedded NPU and compiler stack, our solution achieves an average speedup of 1.8x (4x peak) at equal TOPS and memory resources across standard AI-benchmarks. Even against NPUs with double the compute and memory resources, Neutron delivers up to 3.3x higher performance.