🤖 AI Summary
This work addresses the urgent need for low-latency, low-power on-device processing of sparse event-based audio streams generated by neuromorphic sensors in edge devices. The authors propose a graph neural network–based hardware acceleration architecture that enables end-to-end keyword spotting by modeling cochlea-encoded event data through graph convolution and recurrent sequence modeling. The quantized model is deployed on a system-on-chip (SoC) FPGA, achieving, to the best of the authors’ knowledge, the first end-to-end event-based audio keyword recognition implementation on FPGA. It attains 92.7% accuracy on the SHD dataset and 66.9–71.0% on SSC, with a keyword detection accuracy of 95%. The design reduces model parameters by 10–67×, achieves an inference latency of only 10.53 µs, and consumes 1.18 W, significantly outperforming existing spiking neural network approaches.
📝 Abstract
As the volume of data recorded by embedded edge sensors increases, particularly from neuromorphic devices producing discrete event streams, there is a growing need for hardware-aware neural architectures that enable efficient, low-latency, and energy-conscious local processing. We present an FPGA implementation of event-graph neural networks for audio processing. We utilise an artificial cochlea that converts time-series signals into sparse event data, reducing memory and computation costs. Our architecture was implemented on a SoC FPGA and evaluated on two open-source datasets. For classification task, our baseline floating-point model achieves 92.7% accuracy on SHD dataset - only 2.4% below the state of the art - while requiring over 10x and 67x fewer parameters. On SSC, our models achieve 66.9-71.0% accuracy. Compared to FPGA-based spiking neural networks, our quantised model reaches 92.3% accuracy, outperforming them by up to 19.3% while reducing resource usage and latency. For SSC, we report the first hardware-accelerated evaluation. We further demonstrate the first end-to-end FPGA implementation of event-audio keyword spotting, combining graph convolutional layers with recurrent sequence modelling. The system achieves up to 95% word-end detection accuracy, with only 10.53 microsecond latency and 1.18 W power consumption, establishing a strong benchmark for energy-efficient event-driven KWS.