kNN For Whisper And Its Effect On Bias And Speaker Adaptation

📅 2024-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speech recognition models exhibit uneven performance across languages, domains, and speaker attributes (e.g., accent, gender, age), while fine-tuning often induces catastrophic forgetting. To address this, we propose a training-free, inference-time adaptation method: the first application of token-level k-nearest neighbors (kNN) retrieval to end-to-end speech recognition. During Whisper’s decoding phase, our approach dynamically fuses external speech feature key-value stores, enabling speaker-aware, non-parametric adaptation. It incorporates voice-feature alignment for retrieval and speaker-attribute–aware grouping for evaluation. Experiments demonstrate substantial improvements for underrepresented speaker groups—reducing word error rate (WER) by over 30% for minority accents, genders, and age groups—effectively mitigating systemic bias without degrading general-domain performance. The method preserves model robustness and fairness, establishing a novel paradigm for equitable, adaptable speech recognition.

Technology Category

Application Category

📝 Abstract
Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.
Problem

Research questions and friction points this paper is trying to address.

Improves speaker adaptation in speech recognition
Reduces bias in speech recognition models
Enhances performance across diverse speaker characteristics
Innovation

Methods, ideas, or system contributions that make the work stand out.

kNN for speech adaptation
External datastore without training
Improves Whisper model performance
M
Maya K. Nachesa
Language Technology Lab, University of Amsterdam
Vlad Niculae
Vlad Niculae
University of Amsterdam
Structured PredictionNatural Language ProcessingMachine Learning