🤖 AI Summary
To address the poor robustness of speech recognition for dysarthric speech, the reliance of existing methods on text transcriptions and phoneme alignments, and their limited generalizability, this paper proposes a fully unsupervised dual-path framework jointly modeling prosody and acoustics. Leveraging wav2vec 2.0 self-supervised representations, it achieves alignment-free prosodic normalization; an adversarial acoustic conversion module further maps dysarthric speech to neurotypical acoustic characteristics, enabling compatibility with standard ASR systems. Crucially, the method requires no phoneme alignments, textual transcriptions, or speaker-specific priors, and generalizes effectively to unseen individuals and severely impaired speakers. Evaluated on the TORGO corpus, it significantly enhances performance of large pretrained ASR models—reducing word error rate by 23.6% for severely dysarthric speakers—without any ASR fine-tuning.
📝 Abstract
Automatic speech recognition (ASR) systems are well known to perform poorly on dysarthric speech. Previous works have addressed this by speaking rate modification to reduce the mismatch with typical speech. Unfortunately, these approaches rely on transcribed speech data to estimate speaking rates and phoneme durations, which might not be available for unseen speakers. Therefore, we combine unsupervised rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We evaluate the outputs with a large ASR model pre-trained on healthy speech without further fine-tuning and find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria. Code and audio samples are available at https://idiap.github.io/RnV .