JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Current long-form speech generation models are constrained by binary turn-taking, limiting flexible, unbounded, and multi-speaker (≤8) human-like dialogue synthesis. To address this, we propose E2E-Transformer-DiT—the first end-to-end unified architecture integrating Transformer and diffusion modeling for joint semantic, prosodic, and acoustic representation learning. We design a 12.5 Hz low-bit multi-task MM-Tokenizer with MMSE-based quantization for efficient latent encoding, and incorporate large-scale text perturbation to enhance frontend robustness. Evaluated on Seed-TTS-Eval and multi-speaker long-dialogue cloning tasks, our method achieves state-of-the-art performance, significantly improving prosodic coherence, speaker-specific rhythmic diversity, paralinguistic naturalness, and speech intelligibility. It supports zero-shot cross-lingual voice cloning across Chinese, English, Japanese, and Korean.

Technology Category

Application Category

📝 Abstract

Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce JoyVoice, a novel anthropomorphic foundation model designed for flexible, boundary-free synthesis of up to eight speakers. Unlike conventional cascaded systems, JoyVoice employs a unified E2E-Transformer-DiT architecture that utilizes autoregressive hidden representations directly for diffusion inputs, enabling holistic end-to-end optimization. We further propose a MM-Tokenizer operating at a low bitrate of 12.5 Hz, which integrates multitask semantic and MMSE losses to effectively model both semantic and acoustic information. Additionally, the model incorporates robust text front-end processing via large-scale data perturbation. Experiments show that JoyVoice achieves state-of-the-art results in multilingual generation (Chinese, English, Japanese, Korean) and zero-shot voice cloning. JoyVoice achieves top-tier results on both the Seed-TTS-Eval Benchmark and multi-speaker long-form conversational voice cloning tasks, demonstrating superior audio quality and generalization. It achieves significant improvements in prosodic continuity for long-form speech, rhythm richness in multi-speaker conversations, paralinguistic naturalness, besides superior intelligibility. We encourage readers to listen to the demo at https://jea-speech.github.io/JoyVoice

Problem

Research questions and friction points this paper is trying to address.

Enables multi-speaker conversational synthesis beyond dyadic turn-based interactions

Integrates semantic and acoustic modeling via low-bitrate multitask tokenizer

Achieves state-of-the-art multilingual generation and zero-shot voice cloning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified E2E-Transformer-DiT architecture for holistic optimization

Low bitrate MM-Tokenizer with multitask semantic and acoustic modeling

Robust text front-end via large-scale data perturbation

🔎 Similar Papers

A Framework for Synthetic Audio Conversations Generation Using Large Language Models

2024-09-022024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)Citations: 2