π€ AI Summary
This paper addresses the practical need for one-to-many end-to-end simultaneous speech translation (SimulST) in real-world multilingual scenarios. To this end, we propose the first unified modeling framework supporting joint multilingual training and real-time decoding. Methodologically, we introduce a novel synchronous/asynchronous hybrid training paradigm: asynchronous multilingual pretraining enhances cross-lingual knowledge transfer, while synchronous fine-tuning preserves low-latency constraints; additionally, we design a unified-separate hybrid decoder to balance decoding efficiency and translation quality. We further construct TED-MMSTβthe first publicly available, multi-party aligned, multilingual end-to-end SimulST benchmark dataset. Experiments demonstrate that our approach achieves superior trade-offs between translation quality (BLEU) and latency (Average Lagging Time, ALAT) on TED-MMST. Both the codebase and the TED-MMST dataset are open-sourced.
π Abstract
Recent studies on end-to-end speech translation(ST) have facilitated the exploration of multilingual end-to-end ST and end-to-end simultaneous ST. In this paper, we investigate end-to-end simultaneous speech translation in a one-to-many multilingual setting which is closer to applications in real scenarios. We explore a separate decoder architecture and a unified architecture for joint synchronous training in this scenario. To further explore knowledge transfer across languages, we propose an asynchronous training strategy on the proposed unified decoder architecture. A multi-way aligned multilingual end-to-end ST dataset was curated as a benchmark testbed to evaluate our methods. Experimental results demonstrate the effectiveness of our models on the collected dataset. Our codes and data are available at: https://github.com/XiaoMi/TED-MMST.