Learning Monotonic Attention in Transducer for Streaming Generation

📅 2024-11-26
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Transducer architectures struggle to model non-monotonic alignments—such as those arising in simultaneous translation—due to their inherent monotonic alignment constraint. To address this, we propose MonoAttn-Transducer, the first end-to-end trainable Transducer variant that integrates a monotonic attention mechanism into the decoder. Leveraging the forward-backward algorithm, it explicitly models the posterior distribution over alignments between prediction states and input timestamps, enabling context-aware, dynamic adjustment of the attention span. By avoiding exhaustive enumeration of the exponential alignment space, our approach balances modeling flexibility with computational efficiency. Evaluated on simultaneous translation—a canonical non-monotonic streaming generation task—MonoAttn-Transducer achieves state-of-the-art performance. Moreover, it significantly enhances the generalization capability of industrial-scale Transducer frameworks to complex, non-monotonic alignment patterns, without compromising inference speed or training stability.

Technology Category

Application Category

📝 Abstract
Streaming generation models are increasingly utilized across various fields, with the Transducer architecture being particularly popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation, leading to suboptimal performance in these contexts. In this research, we address this issue by tightly integrating Transducer's decoding with the history of input stream via a learnable monotonic attention mechanism. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the context representations of monotonic attention in training. This allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space. Extensive experiments demonstrate that our MonoAttn-Transducer significantly enhances the handling of non-monotonic alignments in streaming generation, offering a robust solution for Transducer-based frameworks to tackle more complex streaming generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses non-monotonic alignment challenges in Transducer models
Proposes learnable monotonic attention for streaming generation tasks
Enhances Transducer performance in complex generation scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Transducer with learnable monotonic attention
Uses forward-backward algorithm for alignment probability
Estimates context representations without exponential alignment space
🔎 Similar Papers
No similar papers found.
Zhengrui Ma
Zhengrui Ma
Institute of Computing Technology, Chinese Academy of Sciences
Language Modeling
Y
Yang Feng
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences; Key Laboratory of AI Safety, Chinese Academy of Sciences; University of Chinese Academy of Sciences
M
Min Zhang
School of Future Science and Engineering, Soochow University