🤖 AI Summary
This work addresses the limitations of traditional open-vocabulary keyword spotting, which relies on fixed-dimensional embeddings that struggle to balance representational capacity and efficiency. To overcome this, we propose MATE, a novel framework that introduces Matryoshka embeddings to this task for the first time. MATE encodes multi-granular audio-text representations within a single vector via nested sub-embeddings (prefixes), and employs a PCA-guided prefix alignment strategy to ensure that lower-dimensional prefixes capture core semantic cues of keywords while higher-dimensional ones enrich fine-grained details. Built upon a dual-encoder architecture with deep metric learning, MATE enables multi-granular embedding learning without additional inference overhead, supporting flexible open-vocabulary triggering. The method achieves state-of-the-art performance on the WSJ and LibriPhrase datasets.
📝 Abstract
Open-vocabulary keyword spotting (KWS) with text-based enrollment has emerged as a flexible alternative to fixed-phrase triggers. Prior utterance-level matching methods, from an embedding-learning standpoint, learn embeddings at a single fixed dimensionality. We depart from this design and propose Matryoshka Audio-Text Embeddings (MATE), a dual-encoder framework that encodes multiple embedding granularities within a single vector via nested sub-embeddings ("prefixes"). Specifically, we introduce a PCA-guided prefix alignment: PCA-compressed versions of the full text embedding for each prefix size serve as teacher targets to align both audio and text prefixes. This alignment concentrates salient keyword cues in lower-dimensional prefixes, while higher dimensions add detail. MATE is trained with standard deep metric learning objectives for audio-text KWS, and is loss-agnostic. To our knowledge, this is the first application of matryoshka-style embeddings to KWS, achieving state-of-the-art results on WSJ and LibriPhrase without any inference overhead.