Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

To address the degradation of automatic speech recognition (ASR) performance on accented speech, this paper proposes a generative error correction (GER) framework that jointly leverages phonetic and semantic information. Methodologically, it introduces a novel multimodal (acoustic–phoneme–text) and multi-granularity (phoneme-level hypothesis generation and N-best rescoring) collaborative modeling mechanism. Furthermore, it designs a LoRA-based Mixture-of-Experts (MoE) architecture with hierarchical routing and dynamic thresholding to enable accent-aware fusion and precise error correction. A three-stage training strategy integrates LoRA fine-tuning with the generative capabilities of large language models. Evaluated on multiple accented English datasets, the proposed method achieves a 67.35% relative reduction in word error rate (WER) over the Whisper-large-v3 baseline, demonstrating substantial improvements in accent robustness.

Technology Category

Application Category

📝 Abstract

Despite substantial improvements in ASR, performance tends to degrade when faced with adverse conditions such as speaker accents. Generative error correction (GER) leverages the rich linguistic knowledge and exceptional reasoning ability of LLMs, significantly outperforming typical LM methods. However, it lacks specificity in accented speech scenarios. In this study, we leverage GER to improve the accuracy of transcription predictions by addressing the two primary features of accented speech recognition. To fully leverage pronunciation information, we propose the multi-modal GER, which integrates pronunciation information from the speech modality, and the multi-granularity GER, which incorporates fine-grained phoneme-level information related to pronunciation. These two methods enable the LLM to utilize the pronunciation information of accented speech and the semantic information from word-level hypotheses for accurate transcription predictions through LoRA fine-tuning. On the one hand, we employ a three-stage training strategy to train separate multi-modal GER models for each accent to obtain mono-accent LoRA experts. By adopting our proposed HDMoLE method, which incorporates hierarchical routing and dynamic thresholds within the mixture of LoRA experts, we effectively merge multiple mono-accent LoRA experts within a single multi-modal GER to overcome the challenges posed by accent diversity. On the other hand, multi-granularity GER leverages the N-best word-level and phoneme-level hypotheses generated by the HDMoLE model to predict the final accented speech transcriptions. Experimental results on the multi-accent English dataset demonstrate the efficacy of our proposed methods. Our methods achieve a remarkable relative WER reduction of 67.35% compared to the Whisper-large-v3 baseline.

Problem

Research questions and friction points this paper is trying to address.

Improves accented speech recognition accuracy using LLMs

Integrates multi-modal pronunciation and phoneme-level information

Combines mono-accent experts via hierarchical dynamic routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal GER integrates pronunciation from speech

Multi-granularity GER uses phoneme-level pronunciation details

HDMoLE merges mono-accent LoRA experts dynamically

🔎 Similar Papers

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion