🤖 AI Summary
Personalized keyword spotting (KWS) on ultra-low-power edge audio sensors (MCU + microphone) faces severe challenges in post-deployment adaptation due to the absence of labeled data. Method: This paper proposes an on-device unsupervised self-learning framework that enables incremental training of a lightweight DS-CNN model directly on the MCU—requiring only a few seconds of user speech. It leverages similarity-driven pseudo-label generation and dynamic fine-tuning, executed entirely on-device in real time. Contribution/Results: To our knowledge, this is the first work achieving real-time pseudo-labeling and energy-efficient training under MCU-level heterogeneous compute scheduling. The system consumes ≤8.2 mW during audio processing, with sampling intervals of 6.1–18.8 seconds. On public benchmarks, it improves accuracy by 19.2% and 16.0% over pretrained baselines, while on-device training energy is only 10% of that required for manual annotation—overcoming a critical bottleneck for continual model evolution in extreme-edge scenarios.
📝 Abstract
This paper proposes a self-learning method to incrementally train (fine-tune) a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors. We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings. By experimenting with multiple KWS models with a number of parameters up to 0.5M on two public datasets, we show an accuracy improvement of up to +19.2% and +16.0% vs. the initial models pretrained on a large set of generic keywords. The labeling task is demonstrated on a sensor system composed of a low-power microphone and an energy-efficient Microcontroller (MCU). By efficiently exploiting the heterogeneous processing engines of the MCU, the always-on labeling task runs in real-time with an average power cost of up to 8.2 mW. On the same platform, we estimate an energy cost for on-device training 10x lower than the labeling energy if sampling a new utterance every 6.1 s or 18.8 s with a DS-CNN-S or a DS-CNN-M model. Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.