🤖 AI Summary
AED models suffer from decoding order disorder in long-form speech recognition due to the absence of explicit positional awareness in acoustic encoding: segment-wise training forces reliance on local acoustic cues for frame positioning, which vanish under long inputs; moreover, permutation-invariant key-value arrangements in cross-attention impair sequence ordering capability. This paper proposes a segmented attention mechanism addressing these issues via four synergistic components: (1) explicit absolute position injection, (2) long-context acoustic modeling training, (3) segment-level semantic alignment, and (4) concatenative autoregressive decoding. It is the first work to bridge the performance gap between continuous and segmented speech encoding within the Transformer architecture. The approach significantly improves end-to-end autoregressive transcription accuracy for long-form speech and eliminates boundary errors inherent in conventional segmentation-based processing.
📝 Abstract
We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.