Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the insufficient robustness of discrete speech representations for automatic speech recognition (ASR) in noisy environments, this paper proposes a latent-space semantic–noise explicit disentanglement method. Without end-to-end fine-tuning, we freeze the Whisper encoder and explicitly separate discrete codebook tokens—representing semantic content—from interpretable quantized residual noise vectors. A lightweight noise classifier is introduced to supervise the disentanglement process. This approach achieves interpretable, semantically grounded separation of linguistic content from background noise, thereby significantly improving text alignment accuracy and noise robustness. On the VBDemand test set, our method reduces word error rate by 82% relative to the Whisper baseline and outperforms existing disentanglement approaches by 35%. Moreover, it demonstrates strong generalization across both seen and unseen acoustic conditions.

Technology Category

Application Category

📝 Abstract

Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.

Problem

Research questions and friction points this paper is trying to address.

Disentangling semantic speech from background noise

Improving noise-robustness of discrete speech representations

Enhancing ASR performance in noisy environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangles semantic content from background noise

Extracts interpretable noise vectors as quantization residue

Produces noise-invariant speech tokens improving ASR performance

🔎 Similar Papers

Towards the Next Frontier in Speech Representation Learning Using Disentanglement