๐ค AI Summary
Transducer models achieve state-of-the-art performance in end-to-end automatic speech recognition (ASR), but standard beam search decoding severely impedes inference latency. This work proposes the first general-purpose beam search acceleration framework for Transducer, unifying two efficient algorithmsโALSD++ and AES++. Key contributions include: (1) a tree-structured hypothesis representation enabling compact encoder-decoder state management; (2) an improved blank token scoring mechanism to enhance shallow fusion effectiveness; and (3) end-to-end GPU optimization via CUDA Graph integration and batched tensor operations. Experiments demonstrate that the accelerated beam search attains 80โ90% of greedy decoding speed while reducing word error rate (WER) by 14โ30% relative to greedy decoding. In low-resource settings, shallow fusion yields up to 11% WER improvement. The complete implementation is open-sourced.
๐ Abstract
Transducer models have emerged as a promising choice for end-to-end ASR systems, offering a balanced trade-off between recognition accuracy, streaming capabilities, and inference speed in greedy decoding. However, beam search significantly slows down Transducers due to repeated evaluations of key network components, limiting practical applications. This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++. The proposed method utilizes batch operations, a tree-based hypothesis structure, novel blank scoring for enhanced shallow fusion, and CUDA graph execution for efficient GPU inference. This narrows the speed gap between beam and greedy modes to only 10-20% for the whole system, achieves 14-30% relative improvement in WER compared to greedy decoding, and improves shallow fusion for low-resource up to 11% compared to existing implementations. All the algorithms are open sourced.