🤖 AI Summary
This work addresses streaming speech-to-text translation (SST) from English to Chinese and German. We propose an end-to-end unified architecture integrating a chunked causal Wav2Vec 2.0 encoder with a Qwen2.5-7B-Instruct large language model decoder. To jointly optimize latency and translation quality, we introduce a novel two-stage synchronous training paradigm and a configurable latency multiplier that jointly minimizes both computation-aware latency and theoretical latency. Robust speech segments are constructed across multiple datasets—LibriSpeech, CommonVoice, and VoxPopuli—to enhance generalization. Evaluated on the IWSLT 2025 SST task using the ACL60/60 development set, our system achieves 44.3 BLEU (En→Zh) and 25.1 BLEU (En→De), with corresponding computation-aware latencies of 2.7 s and 2.3 s, and theoretical latencies as low as 2.2 s and 1.7 s. To the best of our knowledge, this is the first SST system integrating Wav2Vec 2.0 and Qwen2.5-7B-Instruct for this task.
📝 Abstract
This paper presents CMU's submission to the IWSLT 2025 Simultaneous Speech Translation (SST) task for translating unsegmented English speech into Chinese and German text in a streaming manner. Our end-to-end speech-to-text system integrates a chunkwise causal Wav2Vec 2.0 speech encoder, an adapter, and the Qwen2.5-7B-Instruct as the decoder. We use a two-stage simultaneous training procedure on robust speech segments curated from LibriSpeech, CommonVoice, and VoxPopuli datasets, utilizing standard cross-entropy loss. Our model supports adjustable latency through a configurable latency multiplier. Experimental results demonstrate that our system achieves 44.3 BLEU for English-to-Chinese and 25.1 BLEU for English-to-German translations on the ACL60/60 development set, with computation-aware latencies of 2.7 seconds and 2.3 seconds, and theoretical latencies of 2.2 and 1.7 seconds, respectively.