đ¤ AI Summary
Simultaneous interpretation (SI) faces critical bottlenecksâincluding inaccurate speech transcription, high latency, speaker diarization errors, target-language expansion (âtranslation inflationâ), and lack of real-time speech generationâespecially in extended dialogues. This paper proposes an end-to-end duplex speech understandingâgeneration framework that jointly integrates automatic speech recognition (ASR), machine translation (MT), text-to-speech synthesis (TTS), and voice cloning. Leveraging large-scale pretraining and reinforcement learning for joint optimization, the framework preserves source-speaker vocal characteristics while achieving ultra-low-latency response. Key contributions include: (i) the first integration of controllable voice cloning into end-to-end SI, effectively resolving multi-speaker confusion and translation inflation; (ii) a ~70% average latency reductionâcloned speech latency drops from 10 s to 3 s; and (iii) human evaluation showing >70% accuracy, with significant improvements in both translation quality and real-time performance over leading commercial systems.
đ Abstract
Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.