Towards efficient keyword spotting using spike-based time difference encoders

πŸ“… 2025-03-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge of deploying keyword spotting on ultra-low-power edge devices, this paper proposes a spiking neural network (SNN) architecture based on temporal difference encoding (TDE) neurons. We introduce TDE neurons to keyword wake-up tasks for the first time, integrating frequency-domain temporal feature modeling with event-driven computation. By formally decomposing preprocessing and rate-based spike encoding, we design a sandwich-style three-layer CuBa-LIF network. Evaluated on the TIDigits dataset, the TDE-SNN achieves 89% accuracyβ€”18 percentage points higher than feedforward CuBa-LIF and comparable to recurrent networks (91%), while reducing synaptic operations by 92%. The model exhibits strong interpretability: its spiking responses directly reflect speech spectral and temporal structure, significantly enhancing energy efficiency and the effectiveness of temporal coding.

Technology Category

Application Category

πŸ“ Abstract
Keyword spotting in edge devices is becoming increasingly important as voice-activated assistants are widely used. However, its deployment is often limited by the extreme low-power constraints of the target embedded systems. Here, we explore the Temporal Difference Encoder (TDE) performance in keyword spotting. This recent neuron model encodes the time difference in instantaneous frequency and spike count to perform efficient keyword spotting with neuromorphic processors. We use the TIdigits dataset of spoken digits with a formant decomposition and rate-based encoding into spikes. We compare three Spiking Neural Networks (SNNs) architectures to learn and classify spatio-temporal signals. The proposed SNN architectures are made of three layers with variation in its hidden layer composed of either (1) feedforward TDE, (2) feedforward Current-Based Leaky Integrate-and-Fire (CuBa-LIF), or (3) recurrent CuBa-LIF neurons. We first show that the spike trains of the frequency-converted spoken digits have a large amount of information in the temporal domain, reinforcing the importance of better exploiting temporal encoding for such a task. We then train the three SNNs with the same number of synaptic weights to quantify and compare their performance based on the accuracy and synaptic operations. The resulting accuracy of the feedforward TDE network (89%) is higher than the feedforward CuBa-LIF network (71%) and close to the recurrent CuBa-LIF network (91%). However, the feedforward TDE-based network performs 92% fewer synaptic operations than the recurrent CuBa-LIF network with the same amount of synapses. In addition, the results of the TDE network are highly interpretable and correlated with the frequency and timescale features of the spoken keywords in the dataset. Our findings suggest that the TDE is a promising neuron model for scalable event-driven processing of spatio-temporal patterns.
Problem

Research questions and friction points this paper is trying to address.

Efficient keyword spotting on low-power edge devices
Comparison of Spiking Neural Networks for temporal signal processing
Temporal Difference Encoder reduces synaptic operations significantly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Temporal Difference Encoder for keyword spotting
Compares three Spiking Neural Networks architectures
Achieves high accuracy with fewer synaptic operations
πŸ”Ž Similar Papers
No similar papers found.