๐ค AI Summary
To address the low GPU parallelization efficiency and high industrial deployment cost of conventional n-gram models in ASR context biasing, this paper proposes the first general-purpose, low-overhead (<7%) greedy decoding framework for context biasing. Methodologically, it reformulates n-grams into a GPU-friendly compact index structure, introduces a lightweight bias integration mechanism, and achieves cross-architecture decoder compatibility with mainstream ASR modelsโincluding transducer, attention-based encoder-decoder, and CTC architectures. Experiments demonstrate that the framework bridges over 50% of the accuracy gap between greedy decoding and beam search in out-of-domain scenarios, significantly reduces latency, and maintains high model agnosticism and deployment simplicity. The implementation is open-sourced.
๐ Abstract
Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.