🤖 AI Summary
Existing trie-based biasing methods for ASR face two key bottlenecks in recognizing rare words: reliance on beam search and computationally expensive score rollback mechanisms—e.g., pre-scoring partial matches like “Bon” only to revoke the score later if the full word “Bonham” is not generated. This work proposes a K-step lookahead prediction–enhanced trie biasing method, which predicts K subsequent decoding steps prior to actual decoding, thereby eliminating score rollbacks entirely and significantly reducing decoding complexity—especially beneficial for large-parameter models. Built upon the Whisper architecture, the method requires only 10 hours of synthetic data for fine-tuning and integrates trie-based context-aware biasing efficiently. On the NSC Part 2 test set, it reduces word error rate from 30.86% to 12.19%, demonstrating substantial improvements in both accuracy and inference efficiency.
📝 Abstract
Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives "bonus scores" to partial hypothesis (e.g. "Bon") that may lead to the generation of the rare word (e.g. "Bonham"). If the full word ("Bonham") isn't ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.