Assessing the Impact of Anisotropy in Neural Representations of Speech: A Case Study on Keyword Spotting

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how the strong anisotropy of speech pre-trained representations (wav2vec 2.0, HuBERT) affects unsupervised cross-speaker keyword spotting (KWS), motivated by text-free retrieval needs in computational linguistics. Method: We propose the first systematic framework to evaluate the role of representation anisotropy in unsupervised keyword localization, integrating dynamic time warping (DTW) with representation similarity matching. Contribution/Results: Experiments show that despite pronounced anisotropy, these models robustly encode phonemic structure and speaker-invariant features. Notably, wav2vec 2.0’s cosine similarity enables direct, annotation-free keyword localization in unseen speech, exhibiting strong generalization and robustness. Crucially, anisotropy is not a fundamental bottleneck for downstream KWS performance. Our findings advance understanding of the geometric properties of speech representations and their task-specific adaptability, offering new insights into representation geometry–task alignment in spoken language processing.

Technology Category

Application Category

📝 Abstract
Pretrained speech representations like wav2vec2 and HuBERT exhibit strong anisotropy, leading to high similarity between random embeddings. While widely observed, the impact of this property on downstream tasks remains unclear. This work evaluates anisotropy in keyword spotting for computational documentary linguistics. Using Dynamic Time Warping, we show that despite anisotropy, wav2vec2 similarity measures effectively identify words without transcription. Our results highlight the robustness of these representations, which capture phonetic structures and generalize across speakers. Our results underscore the importance of pretraining in learning rich and invariant speech representations.
Problem

Research questions and friction points this paper is trying to address.

Evaluates anisotropy impact on keyword spotting tasks
Assesses wav2vec2 similarity for word identification without transcription
Explores pretraining role in robust speech representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates anisotropy in keyword spotting
Uses Dynamic Time Warping for similarity
Leverages pretrained wav2vec2 for robustness
🔎 Similar Papers
No similar papers found.
Guillaume Wisniewski
Guillaume Wisniewski
Université de Paris
Statistical Machine LearningParsingMachine TranslationNatural Language Processing
S
S'everine Guillaume
LACITO, CNRS, Université Sorbonne Nouvelle, F-94800 Villejuif, France
C
Clara Rosina Fern'andez
LACITO, CNRS, Université Sorbonne Nouvelle, F-94800 Villejuif, France