🤖 AI Summary
Speech recognition models exhibit uneven performance across languages, domains, and speaker attributes (e.g., accent, gender, age), while fine-tuning often induces catastrophic forgetting. To address this, we propose a training-free, inference-time adaptation method: the first application of token-level k-nearest neighbors (kNN) retrieval to end-to-end speech recognition. During Whisper’s decoding phase, our approach dynamically fuses external speech feature key-value stores, enabling speaker-aware, non-parametric adaptation. It incorporates voice-feature alignment for retrieval and speaker-attribute–aware grouping for evaluation. Experiments demonstrate substantial improvements for underrepresented speaker groups—reducing word error rate (WER) by over 30% for minority accents, genders, and age groups—effectively mitigating systemic bias without degrading general-domain performance. The method preserves model robustness and fairness, establishing a novel paradigm for equitable, adaptable speech recognition.
📝 Abstract
Speech recognition performance varies by language, domain, and speaker characteristics such as accent, but fine-tuning a model on any of these categories may lead to catastrophic forgetting. Token-level $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that instead adapts using inference-time search in an external datastore, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age.