🤖 AI Summary
Traditional CTC beam search relies on CPU-based sequential execution, suffering from low hardware utilization and substantial CPU–GPU synchronization overhead. This paper proposes the first fully GPU-accelerated CTC beam search decoder implemented natively in PyTorch/CUDA, supporting batched parallel decoding and end-to-end language model (LM) integration. We introduce CUDA Graphs to minimize kernel launch overhead and enable native GPU execution of N-gram LMs, phrase-level boosting, and context-aware dynamic decoding. Our approach preserves high recognition accuracy while significantly improving decoding throughput, eliminating cross-device data movement bottlenecks, and enabling industrial-scale real-time ASR deployment. The implementation is open-sourced, offering high efficiency, scalability, and usability.
📝 Abstract
While beam search improves speech recognition quality over greedy decoding, standard implementations are slow, often sequential, and CPU-bound. To fully leverage modern hardware capabilities, we present a novel open-source FlexCTC toolkit for fully GPU-based beam decoding, designed for Connectionist Temporal Classification (CTC) models. Developed entirely in Python and PyTorch, it offers a fast, user-friendly, and extensible alternative to traditional C++, CUDA, or WFST-based decoders. The toolkit features a high-performance, fully batched GPU implementation with eliminated CPU-GPU synchronization and minimized kernel launch overhead via CUDA Graphs. It also supports advanced contextualization techniques, including GPU-powered N-gram language model fusion and phrase-level boosting. These features enable accurate and efficient decoding, making them suitable for both research and production use.