Online Register for Dual-Mode Self-Supervised Speech Models: Mitigating The Lack of Future Context

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the significant performance degradation of bimodal self-supervised models in streaming speech processing due to the absence of future contextual information. To mitigate this limitation without introducing additional latency, the authors propose an “online register” mechanism that appends learnable virtual placeholders after each audio chunk. These registers are trained with a future prediction loss to capture forward-looking semantic cues, effectively compensating for the missing context in real-time processing. The approach substantially narrows the performance gap between offline and online inference regimes. Empirical evaluations on LibriSpeech and out-of-domain benchmarks demonstrate its efficacy, achieving a 3.4% relative word error rate reduction under a 160-millisecond chunking configuration.

Technology Category

Application Category

📝 Abstract

Dual-mode self-supervised speech models (S3Ms), which jointly pre-trained in the offline and online mode, suffer from attention mismatch in streaming scenarios due to missing future context. To address this challenge, we proposed online registers, learnable tokens appended to each chunk in online mode. These tokens act as virtual placeholders for unseen future frames, enabling the model to compensate for missing context without introducing additional latency. Furthermore, we introduce a future prediction loss that explicitly guides the registers to capture predictive cues, thereby enriching their ability to retain future information. Experiments on LibriSpeech, and out-of-domain benchmarks demonstrate that online registers consistently reduce the performance gap between offline and online modes, achieving a 3.4% relative improvement on LibriSpeech with 160 ms chunks, especially in low-latency settings.

Problem

Research questions and friction points this paper is trying to address.

dual-mode self-supervised speech models

streaming scenarios

future context

attention mismatch

online inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

online registers

dual-mode self-supervised speech models

future context prediction

streaming speech recognition

low-latency modeling

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

2024-06-09InterspeechCitations: 1