persoDA: Personalized Data Augmentation for Personalized ASR

📅 2025-01-15

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address the challenge that generic data augmentation techniques fail to adapt to individual users’ acoustic characteristics in mobile personalized automatic speech recognition (ASR), this paper proposes a user-speech-driven personalized data augmentation method. Built upon the Conformer architecture, it models user-specific acoustic features from a small amount of speaker-specific utterances and synthesizes personalized reverberation and noise accordingly, enabling end-to-end user-adaptive augmentation. Unlike conventional multi-condition training (MCT) paradigms relying on fixed, shared augmentation policies, our approach pioneers fine-grained, per-user customization of data augmentation. Evaluated on the VOICES dataset, the method achieves a 13.9% relative reduction in word error rate (WER) compared to standard augmentation, while accelerating model convergence by 16–20%. These results demonstrate substantial improvements in both accuracy and training efficiency for personalized ASR systems.

Technology Category

Application Category

📝 Abstract

Data augmentation (DA) is ubiquitously used in training of Automatic Speech Recognition (ASR) models. DA offers increased data variability, robustness and generalization against different acoustic distortions. Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user's data utilized to personalize ASR. persoDA aims to augment training with data specifically tuned towards acoustic characteristics of the end-user, as opposed to standard augmentation based on Multi-Condition Training (MCT) that applies random reverberation and noises. Our evaluation with an ASR conformer-based baseline trained on Librispeech and personalized for VOICES shows that persoDA achieves a 13.9% relative WER reduction over using standard data augmentation (using random noise&reverberation). Furthermore, persoDA shows 16% to 20% faster convergence over MCT.

Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition

Personalization

Accuracy and Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized Data Augmentation

Automatic Speech Recognition (ASR)

Training Efficiency

🔎 Similar Papers

Personalized Speech Recognition for Children with Test-Time Adaptation