M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper

📅 2024-09-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Whisper exhibits limited accuracy in low-resource dialect ASR due to accent-induced phoneme confusions and syntactic misalignment. To address this, we propose a fine-tuning-free, multi-stage, multi-scale retrieval-augmented framework. In the preprocessing stage, sentence-level in-context learning (ICL) models dialect-specific contextual patterns; in the postprocessing stage, token-level k-nearest-neighbor (kNN) retrieval refines the output probability distribution. Crucially, we introduce a novel dual-scale retrieval coordination mechanism that synergistically integrates sentence- and token-level evidence. The method requires zero parameter updates, preserving full deployment compatibility with the original Whisper model. Evaluated on AISHELL-1 and KeSpeech dialect benchmarks, it achieves an average 18.7% relative reduction in word error rate (WER), markedly improving fine-grained recognition robustness. This work establishes an efficient, lightweight, plug-and-play enhancement paradigm for low-resource dialect ASR.

Technology Category

Application Category

📝 Abstract
State-of-the-art models like OpenAI's Whisper exhibit strong performance in multilingual automatic speech recognition (ASR), but they still face challenges in accurately recognizing diverse subdialects. In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. Building on the principles of in-context learning (ICL) and retrieval-augmented techniques, our method employs sentence-level ICL in the pre-processing stage to harness contextual information, while integrating token-level k-Nearest Neighbors (kNN) retrieval as a post-processing step to further refine the final output distribution. By synergistically combining sentence-level and token-level retrieval strategies, M2R-whisper effectively mitigates various types of recognition errors. Experiments conducted on Mandarin and subdialect datasets, including AISHELL-1 and KeSpeech, demonstrate substantial improvements in ASR accuracy, all achieved without any parameter updates.
Problem

Research questions and friction points this paper is trying to address.

Whisper Model
Accent Recognition
Speech Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

M2R-Whisper
Multi-step Multi-size Searching Enhancement
Dialect Accent Recognition
🔎 Similar Papers
No similar papers found.
J
Jiaming Zhou
TMCC, College of Computer Science, Nankai University, Tianjin, China
Shiwan Zhao
Shiwan Zhao
Independent Researcher, Research Scientist of IBM Research - China (2000-2020)
AGILarge Language ModelNLPSpeechRecommeder System
J
Jiabei He
TMCC, College of Computer Science, Nankai University, Tianjin, China
H
Hui Wang
TMCC, College of Computer Science, Nankai University, Tianjin, China
W
Wenjia Zeng
Lingxi (Beijing) Technology Co., Ltd.
Y
Yong Chen
Lingxi (Beijing) Technology Co., Ltd.
Haoqin Sun
Haoqin Sun
Nankai University
Affective computingSpeech signal processingAudio understanding
Aobo Kong
Aobo Kong
Nankai University
NLPLLM
Y
Yong Qin
TMCC, College of Computer Science, Nankai University, Tianjin, China