Segmental Attention Decoding With Long Form Acoustic Encodings

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
AED models suffer from decoding order disorder in long-form speech recognition due to the absence of explicit positional awareness in acoustic encoding: segment-wise training forces reliance on local acoustic cues for frame positioning, which vanish under long inputs; moreover, permutation-invariant key-value arrangements in cross-attention impair sequence ordering capability. This paper proposes a segmented attention mechanism addressing these issues via four synergistic components: (1) explicit absolute position injection, (2) long-context acoustic modeling training, (3) segment-level semantic alignment, and (4) concatenative autoregressive decoding. It is the first work to bridge the performance gap between continuous and segmented speech encoding within the Transformer architecture. The approach significantly improves end-to-end autoregressive transcription accuracy for long-form speech and eliminates boundary errors inherent in conventional segmentation-based processing.

Technology Category

Application Category

📝 Abstract
We address the fundamental incompatibility of attention-based encoder-decoder (AED) models with long-form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long-form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross-attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross-attention for each decoded segment, (2) long-form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED-decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto-regressive use of the attention decoder.
Problem

Research questions and friction points this paper is trying to address.

Addresses incompatibility of AED models with long acoustic encodings
Solves loss of ordering in cross-attention for long-form decoding
Enables auto-regressive decoding with continuous acoustic encodings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inject absolute positional encodings into cross-attention
Train with extended acoustic context to remove implicit cues
Use semantic segmentation to align decoded and training segments
🔎 Similar Papers
No similar papers found.
Pawel Swietojanski
Pawel Swietojanski
Apple, USA
X
Xinwei Li
Apple, USA
M
Mingbin Xu
Apple, USA
Takaaki Hori
Takaaki Hori
Apple
Speech RecognitionSpoken Language ProcessingMachine Learning
D
Dogan Can
Apple, USA
X
Xiaodan Zhuang
Apple, USA