🤖 AI Summary
Whisper exhibits limited accuracy in low-resource dialect ASR due to accent-induced phoneme confusions and syntactic misalignment. To address this, we propose a fine-tuning-free, multi-stage, multi-scale retrieval-augmented framework. In the preprocessing stage, sentence-level in-context learning (ICL) models dialect-specific contextual patterns; in the postprocessing stage, token-level k-nearest-neighbor (kNN) retrieval refines the output probability distribution. Crucially, we introduce a novel dual-scale retrieval coordination mechanism that synergistically integrates sentence- and token-level evidence. The method requires zero parameter updates, preserving full deployment compatibility with the original Whisper model. Evaluated on AISHELL-1 and KeSpeech dialect benchmarks, it achieves an average 18.7% relative reduction in word error rate (WER), markedly improving fine-grained recognition robustness. This work establishes an efficient, lightweight, plug-and-play enhancement paradigm for low-resource dialect ASR.
📝 Abstract
State-of-the-art models like OpenAI's Whisper exhibit strong performance in multilingual automatic speech recognition (ASR), but they still face challenges in accurately recognizing diverse subdialects. In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. Building on the principles of in-context learning (ICL) and retrieval-augmented techniques, our method employs sentence-level ICL in the pre-processing stage to harness contextual information, while integrating token-level k-Nearest Neighbors (kNN) retrieval as a post-processing step to further refine the final output distribution. By synergistically combining sentence-level and token-level retrieval strategies, M2R-whisper effectively mitigates various types of recognition errors. Experiments conducted on Mandarin and subdialect datasets, including AISHELL-1 and KeSpeech, demonstrate substantial improvements in ASR accuracy, all achieved without any parameter updates.