🤖 AI Summary
Existing ASR post-processing methods suffer from poor correction performance on rare and domain-specific words, often leading to over-correction. Method: This paper proposes a generative LLM-based correction framework that integrates synthetically generated data with speech context. It jointly models N-best ASR hypotheses and phoneme-level contextual information, and introduces a rule-guided, LLM-augmented synthetic data construction strategy to enhance robustness for low-frequency words while mitigating over-correction. The approach incorporates phoneme embedding representation, LLM fine-tuning, and N-best hypothesis rescoring. Contribution/Results: Evaluated on English and Japanese ASR benchmarks, the method achieves significant improvements in rare-word correction accuracy, alongside consistent reductions in both word error rate (WER) and character error rate (CER), demonstrating strong cross-lingual generalization capability.
📝 Abstract
Generative error correction (GER) with large language models (LLMs) has emerged as an effective post-processing approach to improve automatic speech recognition (ASR) performance. However, it often struggles with rare or domain-specific words due to limited training data. Furthermore, existing LLM-based GER approaches primarily rely on textual information, neglecting phonetic cues, which leads to over-correction. To address these issues, we propose a novel LLM-based GER approach that targets rare words and incorporates phonetic information. First, we generate synthetic data to contain rare words for fine-tuning the GER model. Second, we integrate ASR's N-best hypotheses along with phonetic context to mitigate over-correction. Experimental results show that our method not only improves the correction of rare words but also reduces the WER and CER across both English and Japanese datasets.