🤖 AI Summary
This work addresses the challenge of jointly modeling end-to-end automatic speech recognition (ASR) and speech translation (ST). We propose a unified architecture integrating a pretrained speech encoder with a large language model (LLM) as a shared text decoder. Methodologically, we employ multimodal alignment and end-to-end joint training to directly map speech signals to target-language text, while explicitly modeling both task-shared representations and task-specific characteristics for ASR and ST. On English-to-German translation, our model achieves up to an 8% improvement in COMET²² DA score, outperforming mainstream end-to-end baselines such as SeamlessM4T and matching the performance of cascaded ASR+MT systems. The key contribution is the first deep integration of an LLM as a general-purpose text decoder downstream of a speech encoder, enabling multi-task joint optimization. This design preserves the simplicity of end-to-end modeling while significantly enhancing translation quality and robustness.
📝 Abstract
Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $ ext{COMET}^{ ext{DA}}_{22}$ metric.