Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address the performance bottleneck in speech recognition for articulation disorders in low-resource languages caused by data scarcity, this paper proposes a cross-lingual voice conversion (VC)-driven data augmentation method. Leveraging English disordered speech data, we develop a VC model that jointly models speaker identity and prosodic distortions, enabling synthesis of target-language speech with authentic articulation disorder characteristics from healthy non-English utterances. We further incorporate speed- and rhythm-based contrastive perturbations and fine-tune the Massively Multilingual Speech (MMS) ASR model. This work is the first to explicitly model and controllably transfer articulation disorder features via cross-lingual VC. Experiments on Spanish PC-GITA, Italian EasyCall, and Tamil SSNCE demonstrate substantial improvements over baseline MMS and conventional augmentation methods. Both objective metrics and subjective evaluations confirm that the synthesized speech accurately preserves pathological speech characteristics, effectively alleviating the data scarcity challenge for low-resource disordered ASR.

Technology Category

Application Category

📝 Abstract

Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.

Problem

Research questions and friction points this paper is trying to address.

Improving ASR for dysarthric speech in low-resource languages

Generating synthetic dysarthric-like speech using voice conversion

Enhancing multilingual ASR performance for dysarthric speakers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune voice conversion model for dysarthric speech

Convert healthy speech to dysarthric-like speech

Use generated data to improve multilingual ASR

🔎 Similar Papers

No similar papers found.