Peeking Into The Future For Contextual Biasing

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
End-to-end automatic speech recognition (ASR) models achieve strong performance on general speech transcription but exhibit poor recognition accuracy for rare or out-of-vocabulary named entities (e.g., person and place names), limiting their applicability in virtual assistant scenarios. To address this, we propose a lightweight context biasing method tailored to attention-based encoder-decoder (AED) architectures. Our approach introduces a “look-ahead” multi-step future token joint prediction mechanism that performs end-to-end candidate entity matching directly in the decoder’s logit space—without requiring auxiliary entity encoders or cross-attention modules. This design significantly reduces architectural complexity while enhancing named entity awareness. Evaluated on LibriSpeech, our method reduces named entity word error rate by 50.34% relative to the baseline AED model, demonstrating both effectiveness and practicality.

Technology Category

Application Category

📝 Abstract
While end-to-end (E2E) automatic speech recognition (ASR) models excel at general transcription, they struggle to recognize rare or unseen named entities (e.g., contact names, locations), which are critical for downstream applications like virtual assistants. In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. Instead of predicting only the next token, we simultaneously predict multiple future tokens, enabling the model to "peek into the future" and score potential candidate entities in the entity list. Moreover, our approach leverages the multi-token prediction logits directly without requiring additional entity encoders or cross-attention layers, significantly reducing architectural complexity. Experiments on Librispeech demonstrate that our approach achieves up to 50.34% relative improvement in named entity word error rate compared to the baseline AED model.
Problem

Research questions and friction points this paper is trying to address.

Improves recognition of rare named entities in ASR
Uses multi-token future prediction for contextual biasing
Reduces architectural complexity without extra encoders or layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Peek into future tokens for contextual biasing
Score candidate entities without extra encoders
Reduce complexity with multi-token prediction logits
🔎 Similar Papers
No similar papers found.
Ramaneswaran Selvakumar
Ramaneswaran Selvakumar
University Of Maryland, College Park
Deep Learning
Cindy Tseng
Cindy Tseng
Samsung Research America
Automatic Speech RecognitionNatural Language Processing
E
Eesung Kim
Samsung Research America, USA
V
Vijendra Raj Apsingekar
Samsung Research America, USA
Y
Yun Tang
Samsung Research America, USA