DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

In non-autoregressive (NAR) speech recognition, context bias modeling is hindered by static or dynamic vocabulary assumptions of conditional independence among tokens, limiting both modeling capacity and inference efficiency. To address this, we propose the first self-conditioned NAR-CTC framework that injects dynamic vocabulary embeddings into intermediate layers of the encoder. Unlike conventional NAR models, our approach abandons the token-level conditional independence assumption, enabling joint modeling of dynamic vocabularies and contextual information through intermediate-layer encoder conditioning and autoregressive-style context injection. While preserving the inherent parallelism and efficiency of CTC-based NAR architectures, our method significantly improves adaptation to rare phrases. Experiments on LibriSpeech test-clean show an 81% reduction in real-time factor with only a 0.1 percentage-point increase in word error rate—marking the first demonstration of high-accuracy, high-efficiency context bias modeling within an NAR framework.

Technology Category

Application Category

📝 Abstract

Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal classification (CTC), the conditional independence assumption fails to capture dependencies between static and dynamic tokens. This paper proposes DYNAC (Dynamic Vocabulary-based NAR Contextualization), a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers. Conditioning the encoder on dynamic vocabulary, DYNAC effectively captures dependencies between static and dynamic tokens while reducing the real-time factor (RTF). Experimental results show that DYNAC reduces RTF by 81% with a 0.1-point degradation in word error rate on the LibriSpeech 960 test-clean set.

Problem

Research questions and friction points this paper is trying to address.

Improves speech recognition for rare phrases using dynamic vocabulary

Addresses slow inference speed in autoregressive contextual biasing models

Captures dependencies between static and dynamic tokens in non-autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic vocabulary in non-autoregressive models

Self-conditioned CTC for token dependencies

Encoder conditioning reduces real-time factor

🔎 Similar Papers

No similar papers found.