MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses a critical limitation in current speech-to-speech translation (S2ST) systems, which, despite achieving high semantic accuracy, typically omit non-linguistic vocalizations such as laughter and crying, thereby compromising expressive completeness and practical utility. To remedy this, the study introduces MoVE, a novel architecture based on a Mixture of LoRA Experts, incorporating a soft-weight routing mechanism and an efficient synthesis pipeline. Remarkably, MoVE can model complex emotional states using only 30 minutes of annotated data. Experimental results on English-to-Chinese translation demonstrate that the proposed approach accurately preserves 76% of non-linguistic vocalizations—substantially outperforming existing baselines, which achieve at most 14%. Human evaluations further confirm its superiority, with significant gains in both naturalness and emotional fidelity over state-of-the-art systems.

Technology Category

Application Category

📝 Abstract

Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized adapters and a soft-weighting router that blends experts for capturing hybrid expressive states. Third, we show pretrained AudioLLMs enable striking data efficiency: 30 minutes of curated data is enough for strong performance. On English-Chinese S2ST, while comparing with strong baselines, MoVE reproduces target NVs in 76% of cases and achieves the highest human-rated naturalness and emotional fidelity among all compared systems, where existing S2ST systems preserve at most 14% of NVs.

Problem

Research questions and friction points this paper is trying to address.

Speech-to-Speech Translation

Non-verbal Vocalizations

Expressive Speech

Emotional Fidelity

Pragmatic Intent

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-to-Speech Translation

Non-verbal Vocalizations

Mixture-of-Experts