MATE: Matryoshka Audio-Text Embeddings for Open-Vocabulary Keyword Spotting

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional open-vocabulary keyword spotting, which relies on fixed-dimensional embeddings that struggle to balance representational capacity and efficiency. To overcome this, we propose MATE, a novel framework that introduces Matryoshka embeddings to this task for the first time. MATE encodes multi-granular audio-text representations within a single vector via nested sub-embeddings (prefixes), and employs a PCA-guided prefix alignment strategy to ensure that lower-dimensional prefixes capture core semantic cues of keywords while higher-dimensional ones enrich fine-grained details. Built upon a dual-encoder architecture with deep metric learning, MATE enables multi-granular embedding learning without additional inference overhead, supporting flexible open-vocabulary triggering. The method achieves state-of-the-art performance on the WSJ and LibriPhrase datasets.

Technology Category

Application Category

📝 Abstract
Open-vocabulary keyword spotting (KWS) with text-based enrollment has emerged as a flexible alternative to fixed-phrase triggers. Prior utterance-level matching methods, from an embedding-learning standpoint, learn embeddings at a single fixed dimensionality. We depart from this design and propose Matryoshka Audio-Text Embeddings (MATE), a dual-encoder framework that encodes multiple embedding granularities within a single vector via nested sub-embeddings ("prefixes"). Specifically, we introduce a PCA-guided prefix alignment: PCA-compressed versions of the full text embedding for each prefix size serve as teacher targets to align both audio and text prefixes. This alignment concentrates salient keyword cues in lower-dimensional prefixes, while higher dimensions add detail. MATE is trained with standard deep metric learning objectives for audio-text KWS, and is loss-agnostic. To our knowledge, this is the first application of matryoshka-style embeddings to KWS, achieving state-of-the-art results on WSJ and LibriPhrase without any inference overhead.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary keyword spotting
embedding dimensionality
audio-text matching
keyword detection
matryoshka embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Matryoshka embeddings
open-vocabulary keyword spotting
audio-text embedding
PCA-guided alignment
dual-encoder framework
🔎 Similar Papers
No similar papers found.
Youngmoon Jung
Youngmoon Jung
Samsung Research
Deep learningspeaker recognitionvoice activity detectionspeech enhancementspeech synthesis
Myunghun Jung
Myunghun Jung
Samsung Research
Deep Metric LearningKeyword SpottingSpeech Recognition
J
Joon-Young Yang
AI Solution Team, Samsung Research, Seoul, South Korea
Y
Yong-Hyeok Lee
AI Solution Team, Samsung Research, Seoul, South Korea
J
Jaeyoung Roh
AI Solution Team, Samsung Research, Seoul, South Korea
H
Hoon-Young Cho
AI Solution Team, Samsung Research, Seoul, South Korea