CMU's IWSLT 2025 Simultaneous Speech Translation System

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses streaming speech-to-text translation (SST) from English to Chinese and German. We propose an end-to-end unified architecture integrating a chunked causal Wav2Vec 2.0 encoder with a Qwen2.5-7B-Instruct large language model decoder. To jointly optimize latency and translation quality, we introduce a novel two-stage synchronous training paradigm and a configurable latency multiplier that jointly minimizes both computation-aware latency and theoretical latency. Robust speech segments are constructed across multiple datasets—LibriSpeech, CommonVoice, and VoxPopuli—to enhance generalization. Evaluated on the IWSLT 2025 SST task using the ACL60/60 development set, our system achieves 44.3 BLEU (En→Zh) and 25.1 BLEU (En→De), with corresponding computation-aware latencies of 2.7 s and 2.3 s, and theoretical latencies as low as 2.2 s and 1.7 s. To the best of our knowledge, this is the first SST system integrating Wav2Vec 2.0 and Qwen2.5-7B-Instruct for this task.

Technology Category

Application Category

📝 Abstract

This paper presents CMU's submission to the IWSLT 2025 Simultaneous Speech Translation (SST) task for translating unsegmented English speech into Chinese and German text in a streaming manner. Our end-to-end speech-to-text system integrates a chunkwise causal Wav2Vec 2.0 speech encoder, an adapter, and the Qwen2.5-7B-Instruct as the decoder. We use a two-stage simultaneous training procedure on robust speech segments curated from LibriSpeech, CommonVoice, and VoxPopuli datasets, utilizing standard cross-entropy loss. Our model supports adjustable latency through a configurable latency multiplier. Experimental results demonstrate that our system achieves 44.3 BLEU for English-to-Chinese and 25.1 BLEU for English-to-German translations on the ACL60/60 development set, with computation-aware latencies of 2.7 seconds and 2.3 seconds, and theoretical latencies of 2.2 and 1.7 seconds, respectively.

Problem

Research questions and friction points this paper is trying to address.

Streaming translation of unsegmented English speech to Chinese and German text

End-to-end speech-to-text system with adjustable latency support

Achieving high BLEU scores with low computation-aware latencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunkwise causal Wav2Vec 2.0 speech encoder

Qwen2.5-7B-Instruct decoder integration

Configurable latency multiplier adjustment

🔎 Similar Papers

No similar papers found.