🤖 AI Summary
Existing context biasing methods often require additional training, incur decoding latency, or lack architectural compatibility with diverse ASR models. This paper proposes a universal, training-free shallow fusion framework compatible with mainstream ASR architectures—including CTC, Transducer, and attention-based encoder-decoder models. Its core innovation is a GPU-accelerated word boosting tree structure, enabling efficient biasing over keyword lists up to 20,000 entries while preserving native decoding speed and significantly improving recognition accuracy for critical phrases. The method supports both greedy and beam search without modification and has been integrated into the NeMo toolkit. Extensive experiments across multiple ASR systems demonstrate that our approach outperforms existing open-source biasing solutions in both accuracy and decoding efficiency, with zero computational overhead or performance degradation.
📝 Abstract
Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit.