CMU's IWSLT 2025 Simultaneous Speech Translation System

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses streaming speech-to-text translation (SST) from English to Chinese and German. We propose an end-to-end unified architecture integrating a chunked causal Wav2Vec 2.0 encoder with a Qwen2.5-7B-Instruct large language model decoder. To jointly optimize latency and translation quality, we introduce a novel two-stage synchronous training paradigm and a configurable latency multiplier that jointly minimizes both computation-aware latency and theoretical latency. Robust speech segments are constructed across multiple datasets—LibriSpeech, CommonVoice, and VoxPopuli—to enhance generalization. Evaluated on the IWSLT 2025 SST task using the ACL60/60 development set, our system achieves 44.3 BLEU (En→Zh) and 25.1 BLEU (En→De), with corresponding computation-aware latencies of 2.7 s and 2.3 s, and theoretical latencies as low as 2.2 s and 1.7 s. To the best of our knowledge, this is the first SST system integrating Wav2Vec 2.0 and Qwen2.5-7B-Instruct for this task.

Technology Category

Application Category

📝 Abstract
This paper presents CMU's submission to the IWSLT 2025 Simultaneous Speech Translation (SST) task for translating unsegmented English speech into Chinese and German text in a streaming manner. Our end-to-end speech-to-text system integrates a chunkwise causal Wav2Vec 2.0 speech encoder, an adapter, and the Qwen2.5-7B-Instruct as the decoder. We use a two-stage simultaneous training procedure on robust speech segments curated from LibriSpeech, CommonVoice, and VoxPopuli datasets, utilizing standard cross-entropy loss. Our model supports adjustable latency through a configurable latency multiplier. Experimental results demonstrate that our system achieves 44.3 BLEU for English-to-Chinese and 25.1 BLEU for English-to-German translations on the ACL60/60 development set, with computation-aware latencies of 2.7 seconds and 2.3 seconds, and theoretical latencies of 2.2 and 1.7 seconds, respectively.
Problem

Research questions and friction points this paper is trying to address.

Streaming translation of unsegmented English speech to Chinese and German text
End-to-end speech-to-text system with adjustable latency support
Achieving high BLEU scores with low computation-aware latencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunkwise causal Wav2Vec 2.0 speech encoder
Qwen2.5-7B-Instruct decoder integration
Configurable latency multiplier adjustment
🔎 Similar Papers
No similar papers found.
Siqi Ouyang
Siqi Ouyang
PhD Student, Language Technologies Institute, Carnegie Mellon University
Speech TranslationLarge Language Model
X
Xi Xu
Language Technologies Institute, Carnegie Mellon University, USA
L
Lei Li
Language Technologies Institute, Carnegie Mellon University, USA