🤖 AI Summary
To address the low speech recognition accuracy for dysarthric speech, this paper proposes a two-stage Generative Error Rectification (GER) framework. First, an end-to-end ASR model (e.g., Whisper or Conformer) generates an initial transcription; second, a fine-tuned large language model (LLM) performs generative correction guided by semantic coherence and phonetic consistency. The key contribution is the first systematic integration of generative LLM-based correction into dysarthric ASR, complemented by a novel hypothesis selection strategy to enhance robustness against phonetic variability—thereby revealing the complementary roles of acoustic and linguistic modeling. Evaluated on the Speech Accessibility Project dataset, GER significantly improves word-level accuracy for both structured and spontaneous utterances, demonstrating strong correction capability for substitution, insertion, and deletion errors. However, character-level recognition remains challenging.
📝 Abstract
Despite the remarkable progress in end-to-end Automatic Speech Recognition (ASR) engines, accurately transcribing dysarthric speech remains a major challenge. In this work, we proposed a two-stage framework for the Speech Accessibility Project Challenge at INTERSPEECH 2025, which combines cutting-edge speech recognition models with LLM-based generative error correction (GER). We assess different configurations of model scales and training strategies, incorporating specific hypothesis selection to improve transcription accuracy. Experiments on the Speech Accessibility Project dataset demonstrate the strength of our approach on structured and spontaneous speech, while highlighting challenges in single-word recognition. Through comprehensive analysis, we provide insights into the complementary roles of acoustic and linguistic modeling in dysarthric speech recognition