🤖 AI Summary
To address the low automatic speech recognition (ASR) accuracy and extreme scarcity of authentic dysarthric speech data for severe dysarthria patients, this paper proposes an ASR enhancement method leveraging personalized speech synthesis and controllable dysarthria modeling. We synthesize dysarthric speech with continuously varying severity levels by interpolating pre-morbid patient speech with speaker embeddings; subsequently, we fine-tune the monolingual Hungarian ASR model FastConformer_Hu using a small amount of real dysarthric utterances. This approach overcomes the pathological speech data bottleneck and significantly improves zero-shot generalization: character error rate (CER) drops from 36–51% to 7.3%, representing an 18% relative reduction compared to Whisper-turbo. The core contributions are twofold: (1) the first application of speaker embedding interpolation for controllable dysarthric speech synthesis, and (2) synthesis-driven, ultra-low-resource customization of ASR models—enabling effective adaptation with minimal authentic data.
📝 Abstract
We present a case study on developing a customized speech-to-text system for a Hungarian speaker with severe dysarthria. State-of-the-art automatic speech recognition (ASR) models struggle with zero-shot transcription of dysarthric speech, yielding high error rates. To improve performance with limited real dysarthric data, we fine-tune an ASR model using synthetic speech generated via a personalized text-to-speech (TTS) system. We introduce a method for generating synthetic dysarthric speech with controlled severity by leveraging premorbidity recordings of the given speaker and speaker embedding interpolation, enabling ASR fine-tuning on a continuum of impairments. Fine-tuning on both real and synthetic dysarthric speech reduces the character error rate (CER) from 36-51% (zero-shot) to 7.3%. Our monolingual FastConformer_Hu ASR model significantly outperforms Whisper-turbo when fine-tuned on the same data, and the inclusion of synthetic speech contributes to an 18% relative CER reduction. These results highlight the potential of personalized ASR systems for improving accessibility for individuals with severe speech impairments.