Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

๐Ÿ“… 2025-02-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In automatic speech recognition (ASR), encoder-decoder co-alignment introduces structural complexity and hampers inference efficiency. Method: This paper proposes Aligner-Encoderโ€”a Transformer encoder endowed with intrinsic self-transduction capability, enabling end-to-end alignment between acoustic frames and text units during forward pass without decoder involvement in alignment computation. Crucially, it decouples alignment modeling from semantic representation: alignment is explicitly captured via self-attention weights in designated layers, while semantics are modeled elsewhere; this design supports lightweight RNN-Tโ€“style decoding. Training employs frame-level cross-entropy loss, eliminating cross-attention. Results: Aligner-Encoder achieves performance competitive with state-of-the-art methods; inference speed reaches 2ร— that of RNN-T and 16ร— that of attention-based encoder-decoder (AED) models; it significantly outperforms baselines on long-form speech tasks; and its alignment process is both interpretable and observable.

Technology Category

Application Category

๐Ÿ“ Abstract
Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the"Aligner-Encoder". To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention -- it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform"self-transduction".
Problem

Research questions and friction points this paper is trying to address.

Transformer encoder performs internal audio-text alignment.
Aligner-Encoder model simplifies speech recognition architecture.
Self-attention weights visualize audio-text alignment process.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based encoder alignment
Frame-wise cross-entropy loss training
Light text-only recurrence decoder
๐Ÿ”Ž Similar Papers
No similar papers found.