AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation

πŸ“… 2025-03-18
πŸ›οΈ Findings
πŸ“ˆ Citations: 4
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In end-to-end speech translation, static acoustic representations generated by the encoder struggle to satisfy the decoder’s dual demands of cross-modal and cross-lingual modeling. To address this, we propose a dynamic acoustic representation mechanism: acoustic states and target word embeddings are jointly modeled as a unified sequence input, and a speech-text hybrid attention sublayer replaces conventional cross-attention to enable real-time, adaptive modulation of acoustic representations during decoding. Our approach integrates sequence concatenation, dynamic state modulation, and an end-to-end Transformer architecture. Evaluated on MuST-C and CoVoST2 benchmarks, the model substantially outperforms state-of-the-art methods, achieving BLEU gains of 2.1–3.4 points. Notably, it is the first to realize collaborative, dynamic optimization between acoustic representations and textual decoding states.

Technology Category

Application Category

πŸ“ Abstract
In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in speech translation. In this paper, we show the benefits of varying acoustic states according to decoder hidden states and propose an adaptive speech-to-text translation model that is able to dynamically adapt acoustic states in the decoder. We concatenate the acoustic state and target word embedding sequence and feed the concatenated sequence into subsequent blocks in the decoder. In order to model the deep interaction between acoustic states and target hidden states, a speech-text mixed attention sublayer is introduced to replace the conventional cross-attention network. Experiment results on two widely-used datasets show that the proposed method significantly outperforms state-of-the-art neural speech translation models.
Problem

Research questions and friction points this paper is trying to address.

Dynamic adaptation of acoustic states in speech translation.
Improving cross-modal and cross-lingual challenges in translation.
Enhancing decoder performance with adaptive speech-to-text models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic adaptation of encoder states in decoder.
Concatenation of acoustic state and target embeddings.
Speech-text mixed attention replaces cross-attention.
πŸ”Ž Similar Papers
No similar papers found.