M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper

📅 2024-09-18

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Whisper exhibits limited accuracy in low-resource dialect ASR due to accent-induced phoneme confusions and syntactic misalignment. To address this, we propose a fine-tuning-free, multi-stage, multi-scale retrieval-augmented framework. In the preprocessing stage, sentence-level in-context learning (ICL) models dialect-specific contextual patterns; in the postprocessing stage, token-level k-nearest-neighbor (kNN) retrieval refines the output probability distribution. Crucially, we introduce a novel dual-scale retrieval coordination mechanism that synergistically integrates sentence- and token-level evidence. The method requires zero parameter updates, preserving full deployment compatibility with the original Whisper model. Evaluated on AISHELL-1 and KeSpeech dialect benchmarks, it achieves an average 18.7% relative reduction in word error rate (WER), markedly improving fine-grained recognition robustness. This work establishes an efficient, lightweight, plug-and-play enhancement paradigm for low-resource dialect ASR.

Technology Category

Application Category

📝 Abstract

State-of-the-art models like OpenAI's Whisper exhibit strong performance in multilingual automatic speech recognition (ASR), but they still face challenges in accurately recognizing diverse subdialects. In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. Building on the principles of in-context learning (ICL) and retrieval-augmented techniques, our method employs sentence-level ICL in the pre-processing stage to harness contextual information, while integrating token-level k-Nearest Neighbors (kNN) retrieval as a post-processing step to further refine the final output distribution. By synergistically combining sentence-level and token-level retrieval strategies, M2R-whisper effectively mitigates various types of recognition errors. Experiments conducted on Mandarin and subdialect datasets, including AISHELL-1 and KeSpeech, demonstrate substantial improvements in ASR accuracy, all achieved without any parameter updates.

Problem

Research questions and friction points this paper is trying to address.

Whisper Model

Accent Recognition

Speech Accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

M2R-Whisper

Multi-step Multi-size Searching Enhancement

Dialect Accent Recognition

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation