Learning Monotonic Attention in Transducer for Streaming Generation

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Transducer architectures struggle to model non-monotonic alignments—such as those arising in simultaneous translation—due to their inherent monotonic alignment constraint. To address this, we propose MonoAttn-Transducer, the first end-to-end trainable Transducer variant that integrates a monotonic attention mechanism into the decoder. Leveraging the forward-backward algorithm, it explicitly models the posterior distribution over alignments between prediction states and input timestamps, enabling context-aware, dynamic adjustment of the attention span. By avoiding exhaustive enumeration of the exponential alignment space, our approach balances modeling flexibility with computational efficiency. Evaluated on simultaneous translation—a canonical non-monotonic streaming generation task—MonoAttn-Transducer achieves state-of-the-art performance. Moreover, it significantly enhances the generalization capability of industrial-scale Transducer frameworks to complex, non-monotonic alignment patterns, without compromising inference speed or training stability.

Technology Category

Application Category

📝 Abstract

Streaming generation models are increasingly utilized across various fields, with the Transducer architecture being particularly popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation, leading to suboptimal performance in these contexts. In this research, we address this issue by tightly integrating Transducer's decoding with the history of input stream via a learnable monotonic attention mechanism. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the context representations of monotonic attention in training. This allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space. Extensive experiments demonstrate that our MonoAttn-Transducer significantly enhances the handling of non-monotonic alignments in streaming generation, offering a robust solution for Transducer-based frameworks to tackle more complex streaming generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Addresses non-monotonic alignment challenges in Transducer models

Proposes learnable monotonic attention for streaming generation tasks

Enhances Transducer performance in complex generation scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Transducer with learnable monotonic attention

Uses forward-backward algorithm for alignment probability

Estimates context representations without exponential alignment space

🔎 Similar Papers

Streaming Sequence Transduction through Dynamic Compression