Self-Learning for Personalized Keyword Spotting on Ultra-Low-Power Audio Sensors

📅 2024-08-22

🏛️ IEEE Internet of Things Journal

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Personalized keyword spotting (KWS) on ultra-low-power edge audio sensors (MCU + microphone) faces severe challenges in post-deployment adaptation due to the absence of labeled data. Method: This paper proposes an on-device unsupervised self-learning framework that enables incremental training of a lightweight DS-CNN model directly on the MCU—requiring only a few seconds of user speech. It leverages similarity-driven pseudo-label generation and dynamic fine-tuning, executed entirely on-device in real time. Contribution/Results: To our knowledge, this is the first work achieving real-time pseudo-labeling and energy-efficient training under MCU-level heterogeneous compute scheduling. The system consumes ≤8.2 mW during audio processing, with sampling intervals of 6.1–18.8 seconds. On public benchmarks, it improves accuracy by 19.2% and 16.0% over pretrained baselines, while on-device training energy is only 10% of that required for manual annotation—overcoming a critical bottleneck for continual model evolution in extreme-edge scenarios.

Technology Category

Application Category

📝 Abstract

This paper proposes a self-learning method to incrementally train (fine-tune) a personalized Keyword Spotting (KWS) model after the deployment on ultra-low power smart audio sensors. We address the fundamental problem of the absence of labeled training data by assigning pseudo-labels to the new recorded audio frames based on a similarity score with respect to few user recordings. By experimenting with multiple KWS models with a number of parameters up to 0.5M on two public datasets, we show an accuracy improvement of up to +19.2% and +16.0% vs. the initial models pretrained on a large set of generic keywords. The labeling task is demonstrated on a sensor system composed of a low-power microphone and an energy-efficient Microcontroller (MCU). By efficiently exploiting the heterogeneous processing engines of the MCU, the always-on labeling task runs in real-time with an average power cost of up to 8.2 mW. On the same platform, we estimate an energy cost for on-device training 10x lower than the labeling energy if sampling a new utterance every 6.1 s or 18.8 s with a DS-CNN-S or a DS-CNN-M model. Our empirical result paves the way to self-adaptive personalized KWS sensors at the extreme edge.

Problem

Research questions and friction points this paper is trying to address.

Develops self-learning for personalized keyword spotting on low-power sensors.

Solves lack of labeled data by using pseudo-labels from user recordings.

Achieves real-time labeling with minimal energy consumption on MCU systems.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-learning method for personalized KWS models

Pseudo-labeling based on user audio similarity

Ultra-low-power real-time labeling on MCU

🔎 Similar Papers

No similar papers found.

Apple

Seattle, United States of America

Audio Inference Engineer, Model Efficiency

Cohere

Toronto, San Francisco, New York City, London, Paris, Montreal, Seoul, Germany, PST, EST

Authors to Follow