Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios

📅 2025-10-18

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the critical challenge of extreme target-speaker data scarcity (zero- or one-shot) and domain mismatch—rendering conventional data augmentation ineffective—in dysarthric speech recognition (DSR) at the sentence level, this paper proposes a generative data augmentation method grounded in text-semantic matching. Unlike prior approaches, it requires no large-scale source-speaker data; instead, it employs a novel text-coverage strategy to precisely align and synthesize target-speaker pronunciation characteristics. Leveraging only zero or one utterance from the target speaker, it generates high-fidelity, semantically consistent sentence-level augmented samples. Evaluated on low-resource DSR tasks, the method significantly improves recognition accuracy for unseen speakers, achieving a relative 12.6% WER reduction over baselines in zero-/one-shot settings. This work establishes a deployable, generalizable, and lightweight data augmentation paradigm tailored for real-world applications such as speech rehabilitation and daily communication.

Technology Category

Application Category

📝 Abstract

Dysarthric speech recognition (DSR) research has witnessed remarkable progress in recent years, evolving from the basic understanding of individual words to the intricate comprehension of sentence-level expressions, all driven by the pressing communication needs of individuals with dysarthria. Nevertheless, the scarcity of available data remains a substantial hurdle, posing a significant challenge to the development of effective sentence-level DSR systems. In response to this issue, dysarthric data augmentation (DDA) has emerged as a highly promising approach. Generative models are frequently employed to generate training data for automatic speech recognition tasks. However, their effectiveness hinges on the ability of the synthesized data to accurately represent the target domain. The wide-ranging variability in pronunciation among dysarthric speakers makes it extremely difficult for models trained on data from existing speakers to produce useful augmented data, especially in zero-shot or one-shot learning settings. To address this limitation, we put forward a novel text-coverage strategy specifically designed for text-matching data synthesis. This innovative strategy allows for efficient zero/one-shot DDA, leading to substantial enhancements in the performance of DSR when dealing with unseen dysarthric speakers. Such improvements are of great significance in practical applications, including dysarthria rehabilitation programs and day-to-day common-sentence communication scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in dysarthric speech recognition

Improving zero-shot and one-shot data augmentation methods

Enhancing recognition for unseen dysarthric speakers' sentences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-coverage strategy for text-matching data synthesis

Enables zero-shot and one-shot dysarthric data augmentation

Improves speech recognition for unseen dysarthric speakers

🔎 Similar Papers

No similar papers found.