π€ AI Summary
In non-autoregressive (NAR) speech recognition, context bias modeling is hindered by static or dynamic vocabulary assumptions of conditional independence among tokens, limiting both modeling capacity and inference efficiency. To address this, we propose the first self-conditioned NAR-CTC framework that injects dynamic vocabulary embeddings into intermediate layers of the encoder. Unlike conventional NAR models, our approach abandons the token-level conditional independence assumption, enabling joint modeling of dynamic vocabularies and contextual information through intermediate-layer encoder conditioning and autoregressive-style context injection. While preserving the inherent parallelism and efficiency of CTC-based NAR architectures, our method significantly improves adaptation to rare phrases. Experiments on LibriSpeech test-clean show an 81% reduction in real-time factor with only a 0.1 percentage-point increase in word error rateβmarking the first demonstration of high-accuracy, high-efficiency context bias modeling within an NAR framework.
π Abstract
Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal classification (CTC), the conditional independence assumption fails to capture dependencies between static and dynamic tokens. This paper proposes DYNAC (Dynamic Vocabulary-based NAR Contextualization), a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers. Conditioning the encoder on dynamic vocabulary, DYNAC effectively captures dependencies between static and dynamic tokens while reducing the real-time factor (RTF). Experimental results show that DYNAC reduces RTF by 81% with a 0.1-point degradation in word error rate on the LibriSpeech 960 test-clean set.