🤖 AI Summary
To address the challenges of deploying keyword spotting (KWS) on memory- and energy-constrained embedded microcontrollers (MCUs), this paper proposes an end-to-end co-optimization framework—first systematically integrating MFCC feature extraction, lightweight neural network architecture, and MCU hardware acceleration. We design TKWS, a dedicated model incorporating MobileNet variants, residual connections, and insights from DS-CNN and LiCoNet, fully deployed across STM32 N6/H7/U5 platforms. With only 14.4k parameters, TKWS achieves a 92.4% F1-score on standard benchmarks. On the N6 platform, leveraging its neural accelerator, it attains optimal energy-delay product while enabling low-latency inference under high-resolution MFCCs. Our key contribution lies in identifying and quantifying critical software-hardware co-design factors—beyond accuracy—that govern real-world KWS efficiency, establishing a reproducible, ultra-lightweight paradigm for edge KWS deployment.
📝 Abstract
Keyword spotting (KWS) is a key enabling technology for hands-free interaction in embedded and IoT devices, where stringent memory and energy constraints challenge the deployment of AI-enabeld devices. In this work, we systematically evaluate and compare several state-of-the-art lightweight neural network architectures, including DS-CNN, LiCoNet, and TENet, alongside our proposed Typman-KWS (TKWS) architecture built upon MobileNet, specifically designed for efficient KWS on microcontroller units (MCUs). Unlike prior studies focused solely on model inference, our analysis encompasses the entire processing pipeline, from Mel-Frequency Cepstral Coefficient (MFCC) feature extraction to neural inference, and is benchmarked across three STM32 platforms (N6, H7, and U5). Our results show that TKWS with three residual blocks achieves up to 92.4% F1-score with only 14.4k parameters, reducing memory footprint without compromising the accuracy. Moreover, the N6 MCU with integrated neural acceleration achieves the best energy-delay product (EDP), enabling efficient, low-latency operation even with high-resolution features. Our findings highlight the model accuracy alone does not determine real-world effectiveness; rather, optimal keyword spotting deployments require careful consideration of feature extraction parameters and hardware-specific optimization.