Streaming Speech-to-Text Translation with a SpeechLLM

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the limitations of existing speech-to-text translation systems, which typically rely on cascaded architectures prone to error propagation, and current SpeechLLMs that fail to support truly real-time streaming translation, resulting in high latency. The paper proposes the first genuinely streaming SpeechLLM architecture, wherein a large language model learns an end-to-end mapping from speech to translated text and dynamically determines output timing—autonomously deciding when sufficient acoustic context is available to generate the next token, without relying on fixed intervals or waiting for complete utterances. By incorporating paralinguistic cues and training on automatically aligned speech–text data, the method achieves translation quality approaching that of non-streaming baselines across multiple languages while maintaining latency within 1–2 seconds, substantially enhancing practicality and responsiveness.

📝 Abstract

Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close to the non-streaming baseline, but with a latency of only 1-2 seconds.

Problem

Research questions and friction points this paper is trying to address.

streaming speech-to-text translation

real-time translation

latency

SpeechLLM

speech translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming Speech-to-Text Translation

SpeechLLM

Real-time Translation