Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenges of scarce annotations, high costs of subjective evaluation, and poor generalization in assessing dysarthric speech severity by proposing a three-stage unsupervised framework. The approach first leverages a teacher model trained on large-scale typical speech to generate pseudo-labels, followed by label-aware contrastive pretraining, and finally fine-tunes a downstream assessment model. To the best of our knowledge, this is the first method to integrate pseudo-labeling with label-aware contrastive learning, substantially improving cross-lingual and cross-etiology generalization without requiring additional annotations. Built upon the Whisper architecture, the model achieves an average Spearman’s rank correlation coefficient (SRCC) of 0.761 across five unseen test sets, significantly outperforming current state-of-the-art methods such as SpICE.

Technology Category

Application Category

📝 Abstract

Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a three-stage framework that leverages unlabeled dysarthric speech and large-scale typical speech datasets to scale training. A teacher model first generates pseudo-labels for unlabeled samples, followed by weakly supervised pretraining using a label-aware contrastive learning strategy that exposes the model to diverse speakers and acoustic conditions. The pretrained model is then fine-tuned for the downstream DSQA task. Experiments on five unseen datasets spanning multiple etiologies and languages demonstrate the robustness of our approach. Our Whisper-based baseline significantly outperforms SOTA DSQA predictors such as SpICE, and the full framework achieves an average SRCC of 0.761 across unseen test datasets.

Problem

Research questions and friction points this paper is trying to address.

Dysarthric speech

Severity estimation

Data scarcity

Objective assessment

Speech quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

data augmentation

pseudo-labeling

label-aware contrastive learning

dysarthric speech severity estimation

zero-shot generalization

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Research Scientist - Meta Superintelligence Labs (PhD)