End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge of jointly modeling end-to-end automatic speech recognition (ASR) and speech translation (ST). We propose a unified architecture integrating a pretrained speech encoder with a large language model (LLM) as a shared text decoder. Methodologically, we employ multimodal alignment and end-to-end joint training to directly map speech signals to target-language text, while explicitly modeling both task-shared representations and task-specific characteristics for ASR and ST. On English-to-German translation, our model achieves up to an 8% improvement in COMET²² DA score, outperforming mainstream end-to-end baselines such as SeamlessM4T and matching the performance of cascaded ASR+MT systems. The key contribution is the first deep integration of an LLM as a general-purpose text decoder downstream of a speech encoder, enabling multi-task joint optimization. This design preserves the simplicity of end-to-end modeling while significantly enhancing translation quality and robustness.

Technology Category

Application Category

📝 Abstract

Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $ ext{COMET}^{ ext{DA}}_{22}$ metric.

Problem

Research questions and friction points this paper is trying to address.

Integrating speech foundational models with LLMs

Simultaneously performing ASR and speech translation

Improving translation accuracy over existing cascade systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates pre-trained speech encoders with LLMs

Simultaneously performs ASR and speech translation

Outperforms SeamlessM4T and matches cascaded systems

🔎 Similar Papers

How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not