NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding

๐Ÿ“… 2025-05-28
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the low GPU parallelization efficiency and high industrial deployment cost of conventional n-gram models in ASR context biasing, this paper proposes the first general-purpose, low-overhead (<7%) greedy decoding framework for context biasing. Methodologically, it reformulates n-grams into a GPU-friendly compact index structure, introduces a lightweight bias integration mechanism, and achieves cross-architecture decoder compatibility with mainstream ASR modelsโ€”including transducer, attention-based encoder-decoder, and CTC architectures. Experiments demonstrate that the framework bridges over 50% of the accuracy gap between greedy decoding and beam search in out-of-domain scenarios, significantly reduces latency, and maintains high model agnosticism and deployment simplicity. The implementation is open-sourced.

Technology Category

Application Category

๐Ÿ“ Abstract
Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM is open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Improves GPU-accelerated n-gram models for ASR context-biasing
Reduces computational overhead in greedy decoding for ASR
Bridges accuracy gap between greedy and beam search
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU-accelerated n-gram language model
Customizable greedy decoding for ASR
Open-sourced with low computational overhead
๐Ÿ”Ž Similar Papers
No similar papers found.