kNN For Whisper And Its Effect On Bias And Speaker Adaptation

📅 2024-10-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Speech recognition models exhibit uneven performance across languages, domains, and speaker attributes (e.g., accent, gender, age), while fine-tuning often induces catastrophic forgetting. To address this, we propose a training-free, inference-time adaptation method: the first application of token-level k-nearest neighbors (kNN) retrieval to end-to-end speech recognition. During Whisper’s decoding phase, our approach dynamically fuses external speech feature key-value stores, enabling speaker-aware, non-parametric adaptation. It incorporates voice-feature alignment for retrieval and speaker-attribute–aware grouping for evaluation. Experiments demonstrate substantial improvements for underrepresented speaker groups—reducing word error rate (WER) by over 30% for minority accents, genders, and age groups—effectively mitigating systemic bias without degrading general-domain performance. The method preserves model robustness and fairness, establishing a novel paradigm for equitable, adaptable speech recognition.

Technology Category

Application Category

📝 Abstract

Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.

Problem

Research questions and friction points this paper is trying to address.

Improves speaker adaptation in speech recognition

Reduces bias in speech recognition models

Enhances performance across diverse speaker characteristics

Innovation

Methods, ideas, or system contributions that make the work stand out.

kNN for speech adaptation

External datastore without training

Improves Whisper model performance

🔎 Similar Papers

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper

2024-09-18arXiv.orgCitations: 0

Authors to Follow