Improved Dysarthric Speech to Text Conversion via TTS Personalization

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low automatic speech recognition (ASR) accuracy and extreme scarcity of authentic dysarthric speech data for severe dysarthria patients, this paper proposes an ASR enhancement method leveraging personalized speech synthesis and controllable dysarthria modeling. We synthesize dysarthric speech with continuously varying severity levels by interpolating pre-morbid patient speech with speaker embeddings; subsequently, we fine-tune the monolingual Hungarian ASR model FastConformer_Hu using a small amount of real dysarthric utterances. This approach overcomes the pathological speech data bottleneck and significantly improves zero-shot generalization: character error rate (CER) drops from 36–51% to 7.3%, representing an 18% relative reduction compared to Whisper-turbo. The core contributions are twofold: (1) the first application of speaker embedding interpolation for controllable dysarthric speech synthesis, and (2) synthesis-driven, ultra-low-resource customization of ASR models—enabling effective adaptation with minimal authentic data.

Technology Category

Application Category

📝 Abstract
We present a case study on developing a customized speech-to-text system for a Hungarian speaker with severe dysarthria. State-of-the-art automatic speech recognition (ASR) models struggle with zero-shot transcription of dysarthric speech, yielding high error rates. To improve performance with limited real dysarthric data, we fine-tune an ASR model using synthetic speech generated via a personalized text-to-speech (TTS) system. We introduce a method for generating synthetic dysarthric speech with controlled severity by leveraging premorbidity recordings of the given speaker and speaker embedding interpolation, enabling ASR fine-tuning on a continuum of impairments. Fine-tuning on both real and synthetic dysarthric speech reduces the character error rate (CER) from 36-51% (zero-shot) to 7.3%. Our monolingual FastConformer_Hu ASR model significantly outperforms Whisper-turbo when fine-tuned on the same data, and the inclusion of synthetic speech contributes to an 18% relative CER reduction. These results highlight the potential of personalized ASR systems for improving accessibility for individuals with severe speech impairments.
Problem

Research questions and friction points this paper is trying to address.

Develop ASR for dysarthric Hungarian speaker with high accuracy
Generate synthetic dysarthric speech using personalized TTS system
Reduce character error rate significantly via fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized TTS for synthetic dysarthric speech
Speaker embedding interpolation for severity control
Fine-tuning ASR with real and synthetic data
🔎 Similar Papers
No similar papers found.
P
Péter Mihajlik
Department of Telecommunications and Artificial Intelligence, Budapest University of Technology, Hungary
Éva Székely
Éva Székely
Assistant Professor, KTH Royal Institute of Technology
speech technologyspeech synthesisdeep learninggenerative modellingbias detection
P
Piroska Barta
Department of Telecommunications and Artificial Intelligence, Budapest University of Technology, Hungary
M
Máté Soma Kádár
Hungarian Research Centre for Linguistics, HUN-REN, Hungary
G
Gergely Dobsinszki
SpeechTex Ltd., Hungary
László Tóth
László Tóth
Institute of Informatics, University of Szeged, Hungary