An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR

📅 2025-03-11

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address the limited performance gains in ASR data augmentation caused by insufficient diversity in synthetic speech, this work systematically evaluates targeted augmentation strategies leveraging flow-based TTS (Flowtron) and voice conversion (an AutoVC variant) to manipulate multiple prosodic attributes—including pitch, speaker identity, and rhythm—for Conformer-Transducer models. We propose the first streaming-capable synthetic framework enabling joint multi-attribute augmentation. Crucially, we find that augmenting a single attribute (e.g., pitch or speaker) yields no improvement; only coordinated modulation of complementary attributes significantly enhances model generalization. On Common Voice and LibriSpeech, our approach achieves relative WER reductions of 11% and up to 35%, respectively, demonstrating that high-fidelity, high-diversity synthetic data substantially benefits ASR training. This work establishes a novel paradigm for controllable, attribute-aware speech augmentation in ASR.

Technology Category

Application Category

📝 Abstract

Augmenting the training data of automatic speech recognition (ASR) systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years. Several works have demonstrated improvements in ASR performance using this augmentation approach. However, because of the lower diversity of synthetic speech, naively combining synthetic and real data often does not yield the best results. In this work, we leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models. Pitch augmentation and VC-based speaker augmentation are found to be ineffective in our setup. Jointly augmenting all other attributes reduces the WER of a Conformer-Transducer model by 11% relative on Common Voice and by up to 35% relative on LibriSpeech compared to training on real data only.

Problem

Research questions and friction points this paper is trying to address.

Evaluates TTS and VC data augmentation for ASR systems.

Assesses impact of speech attribute augmentation on WER.

Explores joint augmentation to reduce WER in ASR models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-based TTS/VC models enhance speech diversity.

Joint augmentation of multiple attributes reduces WER.

Pitch and speaker augmentation found ineffective.

🔎 Similar Papers

Comparative study on noise-augmented training and its effect on adversarial robustness in ASR systems