🤖 AI Summary
To address the performance bottleneck in speech recognition for articulation disorders in low-resource languages caused by data scarcity, this paper proposes a cross-lingual voice conversion (VC)-driven data augmentation method. Leveraging English disordered speech data, we develop a VC model that jointly models speaker identity and prosodic distortions, enabling synthesis of target-language speech with authentic articulation disorder characteristics from healthy non-English utterances. We further incorporate speed- and rhythm-based contrastive perturbations and fine-tune the Massively Multilingual Speech (MMS) ASR model. This work is the first to explicitly model and controllably transfer articulation disorder features via cross-lingual VC. Experiments on Spanish PC-GITA, Italian EasyCall, and Tamil SSNCE demonstrate substantial improvements over baseline MMS and conventional augmentation methods. Both objective metrics and subjective evaluations confirm that the synthesized speech accurately preserves pathological speech characteristics, effectively alleviating the data scarcity challenge for low-resource disordered ASR.
📝 Abstract
Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.