Marco-Voice Technical Report

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of jointly preserving speaker identity and enabling fine-grained emotional controllability in multilingual, multi-emotion text-to-speech synthesis. We propose a speaker–emotion disentangled TTS framework that explicitly separates speaker and emotion representations via intra-batch contrastive learning, and introduces Rotational Emotion Embedding (REE) to support continuous, granular emotional style control. The neural acoustic model integrates emotion embedding modulation with disentanglement constraints. To facilitate training and evaluation, we construct CSEMOTIONS—a high-quality, Mandarin multi-emotion speech dataset. Experiments demonstrate significant improvements over baselines across key metrics: MOS (4.21), speaker similarity (SIM: 0.93), and emotion classification accuracy (87.6%). The method achieves unified modeling of speaker consistency and expressive emotional variation without compromising naturalness.

Technology Category

Application Category

📝 Abstract

This paper presents a multifunctional speech synthesis system that integrates voice cloning and emotion control speech synthesis within a unified framework. The goal of this work is to address longstanding challenges in achieving highly expressive, controllable, and natural speech generation that faithfully preserves speaker identity across diverse linguistic and emotional contexts. Our approach introduces an effective speaker-emotion disentanglement mechanism with in-batch contrastive learning, enabling independent manipulation of speaker identity and eemotional style, as well as rotational emotional embedding integration method for smooth emotion control. To support comprehensive training and evaluation, we construct CSEMOTIONS, a high-quality emotional speech dataset containing 10 hours of Mandarin speech from six professional speakers across seven emotional categories. Extensive experiments demonstrate that our system, Marco-Voice, achieves substantial improvements in both objective and subjective metrics. Comprehensive evaluations and analysis were conducted, results show that MarcoVoice delivers competitive performance in terms of speech clarity and emotional richness, representing a substantial advance in the field of expressive neural speech synthesis.

Problem

Research questions and friction points this paper is trying to address.

Achieving expressive, controllable, natural speech synthesis

Preserving speaker identity across diverse contexts

Disentangling speaker identity and emotional style

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker-emotion disentanglement with contrastive learning

Rotational emotional embedding for smooth control

High-quality emotional speech dataset CSEMOTIONS

🔎 Similar Papers

No similar papers found.