Improved Dysarthric Speech to Text Conversion via TTS Personalization

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

To address the low automatic speech recognition (ASR) accuracy and extreme scarcity of authentic dysarthric speech data for severe dysarthria patients, this paper proposes an ASR enhancement method leveraging personalized speech synthesis and controllable dysarthria modeling. We synthesize dysarthric speech with continuously varying severity levels by interpolating pre-morbid patient speech with speaker embeddings; subsequently, we fine-tune the monolingual Hungarian ASR model FastConformer_Hu using a small amount of real dysarthric utterances. This approach overcomes the pathological speech data bottleneck and significantly improves zero-shot generalization: character error rate (CER) drops from 36–51% to 7.3%, representing an 18% relative reduction compared to Whisper-turbo. The core contributions are twofold: (1) the first application of speaker embedding interpolation for controllable dysarthric speech synthesis, and (2) synthesis-driven, ultra-low-resource customization of ASR models—enabling effective adaptation with minimal authentic data.

Technology Category

Application Category

📝 Abstract

We present a case study on developing a customized speech-to-text system for a Hungarian speaker with severe dysarthria. State-of-the-art automatic speech recognition (ASR) models struggle with zero-shot transcription of dysarthric speech, yielding high error rates. To improve performance with limited real dysarthric data, we fine-tune an ASR model using synthetic speech generated via a personalized text-to-speech (TTS) system. We introduce a method for generating synthetic dysarthric speech with controlled severity by leveraging premorbidity recordings of the given speaker and speaker embedding interpolation, enabling ASR fine-tuning on a continuum of impairments. Fine-tuning on both real and synthetic dysarthric speech reduces the character error rate (CER) from 36-51% (zero-shot) to 7.3%. Our monolingual FastConformer_Hu ASR model significantly outperforms Whisper-turbo when fine-tuned on the same data, and the inclusion of synthetic speech contributes to an 18% relative CER reduction. These results highlight the potential of personalized ASR systems for improving accessibility for individuals with severe speech impairments.

Problem

Research questions and friction points this paper is trying to address.

Develop ASR for dysarthric Hungarian speaker with high accuracy

Generate synthetic dysarthric speech using personalized TTS system

Reduce character error rate significantly via fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized TTS for synthetic dysarthric speech

Speaker embedding interpolation for severity control

Fine-tuning ASR with real and synthetic data

🔎 Similar Papers

No similar papers found.