🤖 AI Summary
To address the performance limitations of keyword spotting (KWS) on resource-constrained smart devices, this paper introduces the Kolmogorov–Arnold Network (KAN) to speech-based KWS for the first time, proposing a synergistic modeling framework that integrates KAN with a lightweight 1D CNN. The method leverages KAN’s superior expressivity for low-dimensional, high-level semantic features while exploiting CNN’s efficiency in capturing local time-frequency patterns. We design multiple learnable ensemble strategies and perform end-to-end training using standard acoustic features—e.g., MFCCs and log-Mel spectrograms. Evaluated on mainstream KWS benchmarks (e.g., Google Speech Commands), the proposed model significantly outperforms pure CNN baselines, achieving absolute accuracy gains of 2.3–4.1% and demonstrating enhanced robustness. This work validates KAN’s effectiveness and potential for lightweight, temporal speech modeling in edge-deployable KWS systems.
📝 Abstract
Keyword spotting (KWS) is an important speech processing component for smart devices with voice assistance capability. In this paper, we investigate if Kolmogorov-Arnold Networks (KAN) can be used to enhance the performance of KWS. We explore various approaches to integrate KAN for a model architecture based on 1D Convolutional Neural Networks (CNN). We find that KAN is effective at modeling high-level features in lower-dimensional spaces, resulting in improved KWS performance when integrated appropriately. The findings shed light on understanding KAN for speech processing tasks and on other modalities for future researchers.