Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing self-supervised speech representations (e.g., WavLM) struggle to fully disentangle speaker identity from linguistic content, thereby degrading downstream task performance. To address this, we propose a lightweight, interpretable linear decomposition framework: leveraging learnable projections with orthogonality constraints, WavLM features are explicitly decomposed into speaker-dependent and speaker-independent components; these are jointly optimized via speaker discrimination loss and content reconstruction objective. Crucially, our method requires no architectural complexity or auxiliary annotations, and—uniquely—achieves *exact*, *lossless*, and *purely linear* speaker disentanglement. Evaluated on voice conversion, it substantially surpasses state-of-the-art methods: speaker similarity decreases by 62%, while speech quality (MOS) and content accuracy (WER) both improve significantly. Inference overhead is negligible.

Technology Category

Application Category

📝 Abstract
Self-supervised learning (SSL) has reduced the reliance on expensive labeling in speech technologies by learning meaningful representations from unannotated data. Since most SSL-based downstream tasks prioritize content information in speech, ideal representations should disentangle content from unwanted variations like speaker characteristics in the SSL representations. However, removing speaker information often degrades other speech components, and existing methods either fail to fully disentangle speaker identity or require resource-intensive models. In this paper, we propose a novel disentanglement method that linearly decomposes SSL representations into speaker-specific and speaker-independent components, effectively generating speaker disentangled representations. Comprehensive experiments show that our approach achieves speaker independence and as such, when applied to content-driven tasks such as voice conversion, our representations yield significant improvements over state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Remove speaker identity from speech representations efficiently
Disentangle content from speaker characteristics in SSL models
Improve performance in content-driven tasks like voice conversion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear decomposition of SSL representations
Effective speaker identity removal
Improves content-driven tasks performance
🔎 Similar Papers
No similar papers found.