UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses expressive speech-to-speech translation (S2ST), aiming to jointly achieve semantic fidelity, speaker identity preservation, and high-fidelity emotional style transfer. To this end, we propose UniSS—a unified single-stage framework that decouples speech into semantic, speaker, and style representations; introduces a cross-modal chain-of-thought prompting mechanism to effectively transfer the translation capabilities of large language models (LLMs) to the speech modality; and integrates LLMs with neural vocoders within an end-to-end differentiable architecture. We also release UniST, a large-scale, high-quality S2ST dataset. Experiments demonstrate that UniSS outperforms prior methods across all key dimensions: translation accuracy, synthesized speech quality, and acoustic consistency—including timbre, speaking rate, and emotion. Notably, UniSS is the first framework to simultaneously model semantics, speaker identity, and expressive style in a unified architecture while enabling high-fidelity cross-lingual and cross-speaker voice translation.

Technology Category

Application Category

📝 Abstract
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.
Problem

Research questions and friction points this paper is trying to address.

Translating speech content while preserving speaker identity and emotional style
Addressing scarcity of paired expressive speech data and complex multi-stage pipelines
Transferring translation capabilities from text language models to speech domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stage framework for expressive speech translation
Cross-modal chain-of-thought prompting for text-speech alignment
Large-scale expressive dataset construction with style preservation
🔎 Similar Papers
No similar papers found.
S
Sitong Cheng
Hong Kong University of Science and Technology
W
Weizhen Bian
Hong Kong University of Science and Technology
Xinsheng Wang
Xinsheng Wang
Hong Kong University of Science and Technology (HKUST)
speech synthesissinging voice synthesisvoice conversion
Ruibin Yuan
Ruibin Yuan
HKUST
Artificial IntelligenceMusic GenerationMusic Information RetrievalComputer Music
J
Jianyi Chen
Hong Kong University of Science and Technology
S
Shunshun Yin
Soul AI Lab
Y
Yike Guo
Hong Kong University of Science and Technology
W
Wei Xue
Hong Kong University of Science and Technology