Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

To address the challenges of overfitting to rare words and susceptibility to synthetic audio artifacts in context-biased models trained on synthetic data, this paper proposes a keyword-aware multi-task loss function that jointly optimizes masked cross-entropy (focused on target keywords) and keyword position binary classification. The method builds upon the Whisper architecture and the TCPGen framework, integrating synthetic data training, context bias modeling, and explicit keyword position prediction. Evaluated on the NSC Part 2 test set, the approach reduces word error rate from 29.71% to 11.81%, significantly improving rare-word recognition accuracy and decoding robustness. The core contribution lies in explicitly incorporating keyword localization into the loss design, thereby mitigating bias and overfitting induced by synthetic data. This enables more reliable contextual adaptation without compromising generalization.

Technology Category

Application Category

📝 Abstract

Rare word recognition can be improved by adapting ASR models to synthetic data that includes these words. Further improvements can be achieved through contextual biasing, which trains and adds a biasing module into the model architecture to prioritize rare words. While training the module on synthetic rare word data is more effective than using non-rare-word data, it can lead to overfitting due to artifacts in the synthetic audio. To address this, we enhance the TCPGen-based contextual biasing approach and propose a keyword-aware loss function that additionally focuses on biased words when training biasing modules. This loss includes a masked cross-entropy term for biased word prediction and a binary classification term for detecting biased word positions. These two terms complementarily support the decoding of biased words during inference. By adapting Whisper to 10 hours of synthetic data, our method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%.

Problem

Research questions and friction points this paper is trying to address.

Overfitting in ASR models from synthetic rare word training data artifacts

Improving rare word recognition through contextual biasing module enhancement

Developing keyword-aware loss function for better biased word decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Keyword-aware loss function for training

Masked cross-entropy for biased word prediction

Binary classification for biased word positions

🔎 Similar Papers

No similar papers found.