Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance bottleneck in speech recognition for articulation disorders in low-resource languages caused by data scarcity, this paper proposes a cross-lingual voice conversion (VC)-driven data augmentation method. Leveraging English disordered speech data, we develop a VC model that jointly models speaker identity and prosodic distortions, enabling synthesis of target-language speech with authentic articulation disorder characteristics from healthy non-English utterances. We further incorporate speed- and rhythm-based contrastive perturbations and fine-tune the Massively Multilingual Speech (MMS) ASR model. This work is the first to explicitly model and controllably transfer articulation disorder features via cross-lingual VC. Experiments on Spanish PC-GITA, Italian EasyCall, and Tamil SSNCE demonstrate substantial improvements over baseline MMS and conventional augmentation methods. Both objective metrics and subjective evaluations confirm that the synthesized speech accurately preserves pathological speech characteristics, effectively alleviating the data scarcity challenge for low-resource disordered ASR.

Technology Category

Application Category

📝 Abstract
Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.
Problem

Research questions and friction points this paper is trying to address.

Improving ASR for dysarthric speech in low-resource languages
Generating synthetic dysarthric-like speech using voice conversion
Enhancing multilingual ASR performance for dysarthric speakers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune voice conversion model for dysarthric speech
Convert healthy speech to dysarthric-like speech
Use generated data to improve multilingual ASR
🔎 Similar Papers
No similar papers found.
C
Chin-Jou Li
Carnegie Mellon University, USA
E
Eunjung Yeo
Carnegie Mellon University, USA
Kwanghee Choi
Kwanghee Choi
University of Texas at Austin
SpeechMachine LearningComputational Linguistics
P
Paula Andrea P'erez-Toro
FAU Erlangen-Nörnberg, Germany; Universidad de Antioquia, Columbia
Masao Someki
Masao Someki
Carnegie Mellon University
Speech processing
Rohan Kumar Das
Rohan Kumar Das
Fortemedia Singapore
Speech ProcessingSpeaker VerificationAnti-spoofingDeep LearningHuman-Computer Interaction
Zhengjun Yue
Zhengjun Yue
Assistant Professor, TU Delft
Speech Technology for healthcarePathological speech recogntiion
J
Juan Rafael Orozco-Arroyave
FAU Erlangen-Nörnberg, Germany; Universidad de Antioquia, Columbia
E
Elmar Noth
FAU Erlangen-Nörnberg, Germany
D
David R. Mortensen
Carnegie Mellon University, USA