Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the fairness gap in automatic speech recognition (ASR), where low-resource languages significantly underperform high-resource ones, this paper proposes Latent Mixup—a novel data augmentation method based on latent-space interpolation. Latent Mixup performs cross-lingual and cross-accent feature mixing within intermediate hidden layers of self-supervised pretrained speech models (e.g., wav2vec 2.0), generating semantically coherent and diverse synthetic speech representations. By operating in the latent space rather than on raw waveforms or Mel-spectrograms, it avoids audio distortion while enabling end-to-end optimization of ASR performance. Experiments across 12 low-resource languages—including indigenous languages from Africa and South America—demonstrate that Latent Mixup reduces word error rate (WER) by an average of 18.3%, outperforming conventional time-domain, frequency-domain, and label-shuffling augmentation techniques. The approach offers a scalable, robust, and equitable solution for under-resourced language ASR.

Technology Category

Application Category

📝 Abstract
Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.
Problem

Research questions and friction points this paper is trying to address.

Addressing performance disparity in speech recognition for low-resource languages
Developing data augmentation to reduce the language gap in speech technology
Improving automatic speech recognition for underrepresented linguistic communities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses latent mixup for synthetic voice diversity
Enhances speech recognition for low-resource languages
Outperforms existing data augmentation strategies
🔎 Similar Papers
No similar papers found.
W
Wesley Bian
University of California Los Angeles, Department of Statistics, Los Angeles, United States of America
Xiaofeng Lin
Xiaofeng Lin
PhD Candidate, Boston University
Sequential Decision MakingRobotics
G
Guang Cheng
University of California Los Angeles, Department of Statistics, Los Angeles, United States of America