🤖 AI Summary
To address the critical challenge of extreme target-speaker data scarcity (zero- or one-shot) and domain mismatch—rendering conventional data augmentation ineffective—in dysarthric speech recognition (DSR) at the sentence level, this paper proposes a generative data augmentation method grounded in text-semantic matching. Unlike prior approaches, it requires no large-scale source-speaker data; instead, it employs a novel text-coverage strategy to precisely align and synthesize target-speaker pronunciation characteristics. Leveraging only zero or one utterance from the target speaker, it generates high-fidelity, semantically consistent sentence-level augmented samples. Evaluated on low-resource DSR tasks, the method significantly improves recognition accuracy for unseen speakers, achieving a relative 12.6% WER reduction over baselines in zero-/one-shot settings. This work establishes a deployable, generalizable, and lightweight data augmentation paradigm tailored for real-world applications such as speech rehabilitation and daily communication.
📝 Abstract
Dysarthric speech recognition (DSR) research has witnessed remarkable progress in recent years, evolving from the basic understanding of individual words to the intricate comprehension of sentence-level expressions, all driven by the pressing communication needs of individuals with dysarthria. Nevertheless, the scarcity of available data remains a substantial hurdle, posing a significant challenge to the development of effective sentence-level DSR systems. In response to this issue, dysarthric data augmentation (DDA) has emerged as a highly promising approach. Generative models are frequently employed to generate training data for automatic speech recognition tasks. However, their effectiveness hinges on the ability of the synthesized data to accurately represent the target domain. The wide-ranging variability in pronunciation among dysarthric speakers makes it extremely difficult for models trained on data from existing speakers to produce useful augmented data, especially in zero-shot or one-shot learning settings. To address this limitation, we put forward a novel text-coverage strategy specifically designed for text-matching data synthesis. This innovative strategy allows for efficient zero/one-shot DDA, leading to substantial enhancements in the performance of DSR when dealing with unseen dysarthric speakers. Such improvements are of great significance in practical applications, including dysarthria rehabilitation programs and day-to-day common-sentence communication scenarios.