🤖 AI Summary
This work addresses the high power consumption and redundant preprocessing inherent in keyword spotting on edge devices by proposing the first end-to-end FPGA system that directly processes event streams from neuromorphic auditory sensors. By integrating neuromorphic sensing, graph neural networks, compute-in-memory architecture, and quantized inference onto a single FPGA chip, the system enables real-time keyword recognition without conventional audio preprocessing. Evaluated on the Google Speech Commands v2 dataset, the system achieves an accuracy of 87.43% with an end-to-end latency below 35 microseconds and an average power consumption of only 1.12 watts, substantially improving both energy efficiency and response speed.
📝 Abstract
With the rapid growth of mobile robotics and embedded intelligence, there is an increasing demand for efficient on-device data processing on edge platforms. A promising research direction is the use of neuromorphic sensors inspired by human sensory systems, which generate sparse, event-based data encoding changes in the environment. In this work, we present the first end-to-end FPGA implementation of a keyword spotting system that integrates a Neuromorphic Auditory Sensor (NAS) and a graph neural network (GNN) on a single FPGA device, enabling real-time processing of raw audio data. The proposed architecture eliminates conventional signal preprocessing and operates directly on event-based audio streams. Leveraging a compute-near-memory network architecture, the system achieves efficient inference with low latency and low power consumption. Experimental results demonstrate an accuracy of 87.43% after quantization on the Google Speech Commands v2 dataset processed through the neuromorphic sensor, with end-to-end latency below 35 us and average power consumption of 1.12 W. The processed datasets, software models, and hardware modules are available at https://github.com/vision-agh/NAS-GNN-KWS.