persoDA: Personalized Data Augmentation for Personalized ASR

πŸ“… 2025-01-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenge that generic data augmentation techniques fail to adapt to individual users’ acoustic characteristics in mobile personalized automatic speech recognition (ASR), this paper proposes a user-speech-driven personalized data augmentation method. Built upon the Conformer architecture, it models user-specific acoustic features from a small amount of speaker-specific utterances and synthesizes personalized reverberation and noise accordingly, enabling end-to-end user-adaptive augmentation. Unlike conventional multi-condition training (MCT) paradigms relying on fixed, shared augmentation policies, our approach pioneers fine-grained, per-user customization of data augmentation. Evaluated on the VOICES dataset, the method achieves a 13.9% relative reduction in word error rate (WER) compared to standard augmentation, while accelerating model convergence by 16–20%. These results demonstrate substantial improvements in both accuracy and training efficiency for personalized ASR systems.

Technology Category

Application Category

πŸ“ Abstract
Data augmentation (DA) is ubiquitously used in training of Automatic Speech Recognition (ASR) models. DA offers increased data variability, robustness and generalization against different acoustic distortions. Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user's data utilized to personalize ASR. persoDA aims to augment training with data specifically tuned towards acoustic characteristics of the end-user, as opposed to standard augmentation based on Multi-Condition Training (MCT) that applies random reverberation and noises. Our evaluation with an ASR conformer-based baseline trained on Librispeech and personalized for VOICES shows that persoDA achieves a 13.9% relative WER reduction over using standard data augmentation (using random noise&reverberation). Furthermore, persoDA shows 16% to 20% faster convergence over MCT.
Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition
Personalization
Accuracy and Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized Data Augmentation
Automatic Speech Recognition (ASR)
Training Efficiency
πŸ”Ž Similar Papers
No similar papers found.
Pablo Peso Parada
Pablo Peso Parada
AI Researcher - Samsung Research UK
signal processingmachine learningopen source hardwareaudiospeech
S
S. Fontalis
Centre for Research and Technology Hellas, Greece
Md Asif Jalal
Md Asif Jalal
Machine Learning researcher
Machine LearningASRSpeech ProcessingAffective ComputingGenerative AI
K
Karthikeyan Saravanan
Samsung R&D Institute UK (SRUK), United Kingdom
Anastasios Drosou
Anastasios Drosou
CERTH-ITI
M
Mete Ozay
Samsung R&D Institute UK (SRUK), United Kingdom
Gil Ho Lee
Gil Ho Lee
Language AI R&D Group (MX), Samsung Electronics, South Korea
J
Jungin Lee
Language AI R&D Group (MX), Samsung Electronics, South Korea
S
Seokyeong Jung
Language AI R&D Group (MX), Samsung Electronics, South Korea