Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the limitations of conventional RNN-Transducer (RNN-T) models in streaming speech recognition and translation, which suffer from restricted local modeling capacity due to strict monotonic alignment constraints, along with high computational and memory costs. To overcome these issues, the authors propose Chunk-wise Attention Transducer (CHAT), the first approach to integrate intra-chunk cross-attention mechanisms into the RNN-T architecture. CHAT preserves the streaming nature of RNN-T while significantly enhancing local alignment modeling. The method demonstrates remarkable improvements in model flexibility and expressiveness, particularly excelling in speech translation tasks: it reduces peak training memory by 46.2%, accelerates training and inference by 1.36× and 1.69× respectively, lowers word error rate (WER) in speech recognition by up to 6.3%, and improves BLEU scores in speech translation by up to 18.0%.

Technology Category

Application Category

📝 Abstract

We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk. This hybrid approach maintains RNN-T's streaming capability while introducing controlled flexibility for local alignment modeling. CHAT significantly reduces the temporal dimension that RNN-T must handle, yielding substantial efficiency improvements: up to 46.2% reduction in peak training memory, up to 1.36X faster training, and up to 1.69X faster inference. Alongside these efficiency gains, CHAT achieves consistent accuracy improvements over RNN-T across multiple languages and tasks -- up to 6.3% relative WER reduction for speech recognition and up to 18.0% BLEU improvement for speech translation. The method proves particularly effective for speech translation, where RNN-T's strict monotonic alignment hurts performance. Our results demonstrate that the CHAT model offers a practical solution for deploying more capable streaming speech models without sacrificing real-time constraints.

Problem

Research questions and friction points this paper is trying to address.

streaming speech recognition

RNN-T

monotonic alignment

speech translation

efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunk-wise Attention

Streaming Speech Recognition

RNN-T