🤖 AI Summary
This work addresses the challenge of reconciling performance disparities between offline and low-latency streaming modes in unified automatic speech recognition (ASR) systems. The authors propose a unified RNN-Transducer (RNNT) framework that integrates chunked restricted attention with dynamic chunk-wise convolution, enabling both decoding modes within a single architecture. To further align the performance across modes, they introduce Mode Consistency Regularization for RNNT (MCR-RNNT), an innovative regularization technique accelerated via Triton. This approach preserves offline accuracy while significantly improving streaming performance, and scales effectively to larger models and datasets. The unified ASR framework along with pretrained English model checkpoints has been publicly released.
📝 Abstract
Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.