🤖 AI Summary
To address the challenges of overfitting to rare words and susceptibility to synthetic audio artifacts in context-biased models trained on synthetic data, this paper proposes a keyword-aware multi-task loss function that jointly optimizes masked cross-entropy (focused on target keywords) and keyword position binary classification. The method builds upon the Whisper architecture and the TCPGen framework, integrating synthetic data training, context bias modeling, and explicit keyword position prediction. Evaluated on the NSC Part 2 test set, the approach reduces word error rate from 29.71% to 11.81%, significantly improving rare-word recognition accuracy and decoding robustness. The core contribution lies in explicitly incorporating keyword localization into the loss design, thereby mitigating bias and overfitting induced by synthetic data. This enables more reliable contextual adaptation without compromising generalization.
📝 Abstract
Rare word recognition can be improved by adapting ASR models to synthetic data that includes these words. Further improvements can be achieved through contextual biasing, which trains and adds a biasing module into the model architecture to prioritize rare words. While training the module on synthetic rare word data is more effective than using non-rare-word data, it can lead to overfitting due to artifacts in the synthetic audio. To address this, we enhance the TCPGen-based contextual biasing approach and propose a keyword-aware loss function that additionally focuses on biased words when training biasing modules. This loss includes a masked cross-entropy term for biased word prediction and a binary classification term for detecting biased word positions. These two terms complementarily support the decoding of biased words during inference. By adapting Whisper to 10 hours of synthetic data, our method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%.