Facilitating Personalized TTS for Dysarthric Speakers Using Knowledge Anchoring and Curriculum Learning

๐Ÿ“… 2025-08-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the scarcity of high-quality speech data, poor articulation, and labeling difficulty in personalized text-to-speech (TTS) for dysarthric speakers, this paper proposes a teacherโ€“student framework synergizing knowledge anchoring and curriculum learning. The framework transfers healthy-speech priors to dysarthric speech via a knowledge anchoring mechanism, employs progressive curriculum learning to optimize low-resource adaptation, and integrates audio augmentation with zero-shot multi-speaker modeling. Experiments demonstrate that the method significantly reduces phoneme error rates under extremely low-resource conditions (<10 minutes of clear speech), while preserving high speaker similarity and natural prosody. Speech intelligibility is substantially improved. This work establishes a scalable, robust paradigm for clinical-grade personalized TTS synthesis, effectively bridging the gap between limited dysarthric data and high-fidelity, speaker-consistent voice generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Dysarthric speakers experience substantial communication challenges due to impaired motor control of the speech apparatus, which leads to reduced speech intelligibility. This creates significant obstacles in dataset curation since actual recording of long, articulate sentences for the objective of training personalized TTS models becomes infeasible. Thus, the limited availability of audio data, in addition to the articulation errors that are present within the audio, complicates personalized speech synthesis for target dysarthric speaker adaptation. To address this, we frame the issue as a domain transfer task and introduce a knowledge anchoring framework that leverages a teacher-student model, enhanced by curriculum learning through audio augmentation. Experimental results show that the proposed zero-shot multi-speaker TTS model effectively generates synthetic speech with markedly reduced articulation errors and high speaker fidelity, while maintaining prosodic naturalness.
Problem

Research questions and friction points this paper is trying to address.

Enabling personalized TTS for dysarthric speakers with limited data
Reducing articulation errors in synthetic dysarthric speech
Maintaining speaker fidelity and prosodic naturalness in TTS output
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge anchoring with teacher-student model
Curriculum learning via audio augmentation
Zero-shot multi-speaker TTS synthesis
๐Ÿ”Ž Similar Papers
No similar papers found.