π€ AI Summary
In end-to-end speech translation, static acoustic representations generated by the encoder struggle to satisfy the decoderβs dual demands of cross-modal and cross-lingual modeling. To address this, we propose a dynamic acoustic representation mechanism: acoustic states and target word embeddings are jointly modeled as a unified sequence input, and a speech-text hybrid attention sublayer replaces conventional cross-attention to enable real-time, adaptive modulation of acoustic representations during decoding. Our approach integrates sequence concatenation, dynamic state modulation, and an end-to-end Transformer architecture. Evaluated on MuST-C and CoVoST2 benchmarks, the model substantially outperforms state-of-the-art methods, achieving BLEU gains of 2.1β3.4 points. Notably, it is the first to realize collaborative, dynamic optimization between acoustic representations and textual decoding states.
π Abstract
In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in speech translation. In this paper, we show the benefits of varying acoustic states according to decoder hidden states and propose an adaptive speech-to-text translation model that is able to dynamically adapt acoustic states in the decoder. We concatenate the acoustic state and target word embedding sequence and feed the concatenated sequence into subsequent blocks in the decoder. In order to model the deep interaction between acoustic states and target hidden states, a speech-text mixed attention sublayer is introduced to replace the conventional cross-attention network. Experiment results on two widely-used datasets show that the proposed method significantly outperforms state-of-the-art neural speech translation models.