JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis

πŸ“… 2025-12-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current long-form speech generation models are constrained by binary turn-taking, limiting flexible, unbounded, and multi-speaker (≀8) human-like dialogue synthesis. To address this, we propose E2E-Transformer-DiTβ€”the first end-to-end unified architecture integrating Transformer and diffusion modeling for joint semantic, prosodic, and acoustic representation learning. We design a 12.5 Hz low-bit multi-task MM-Tokenizer with MMSE-based quantization for efficient latent encoding, and incorporate large-scale text perturbation to enhance frontend robustness. Evaluated on Seed-TTS-Eval and multi-speaker long-dialogue cloning tasks, our method achieves state-of-the-art performance, significantly improving prosodic coherence, speaker-specific rhythmic diversity, paralinguistic naturalness, and speech intelligibility. It supports zero-shot cross-lingual voice cloning across Chinese, English, Japanese, and Korean.

Technology Category

Application Category

πŸ“ Abstract
Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce JoyVoice, a novel anthropomorphic foundation model designed for flexible, boundary-free synthesis of up to eight speakers. Unlike conventional cascaded systems, JoyVoice employs a unified E2E-Transformer-DiT architecture that utilizes autoregressive hidden representations directly for diffusion inputs, enabling holistic end-to-end optimization. We further propose a MM-Tokenizer operating at a low bitrate of 12.5 Hz, which integrates multitask semantic and MMSE losses to effectively model both semantic and acoustic information. Additionally, the model incorporates robust text front-end processing via large-scale data perturbation. Experiments show that JoyVoice achieves state-of-the-art results in multilingual generation (Chinese, English, Japanese, Korean) and zero-shot voice cloning. JoyVoice achieves top-tier results on both the Seed-TTS-Eval Benchmark and multi-speaker long-form conversational voice cloning tasks, demonstrating superior audio quality and generalization. It achieves significant improvements in prosodic continuity for long-form speech, rhythm richness in multi-speaker conversations, paralinguistic naturalness, besides superior intelligibility. We encourage readers to listen to the demo at https://jea-speech.github.io/JoyVoice
Problem

Research questions and friction points this paper is trying to address.

Enables multi-speaker conversational synthesis beyond dyadic turn-based interactions
Integrates semantic and acoustic modeling via low-bitrate multitask tokenizer
Achieves state-of-the-art multilingual generation and zero-shot voice cloning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified E2E-Transformer-DiT architecture for holistic optimization
Low bitrate MM-Tokenizer with multitask semantic and acoustic modeling
Robust text front-end via large-scale data perturbation
πŸ”Ž Similar Papers
No similar papers found.
F
Fan Yu
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
T
Tao Wang
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Y
You Wu
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
L
Lin Zhu
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
W
Wei Deng
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
W
Weisheng Han
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
W
Wenchao Wang
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Lin Hu
Lin Hu
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
X
Xiangyu Liang
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
X
Xiaodong He
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Yankun Huang
Yankun Huang
Arizona State University
Y
Yu Gu
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Y
Yuan Liu
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Y
Yuxuan Wang
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Z
Zhangyu Xiao
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Ziteng Wang
Ziteng Wang
University of Texas at Austin
Programming languagesformal methodsverificationprogram synthesis
B
Boya Dong
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
F
Feng Dang
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
J
Jinming Chen
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Jingdong Li
Jingdong Li
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
J
Jun Wang
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Y
Yechen Jin
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Y
Yuan Zhang
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.
Zhengyan Sheng
Zhengyan Sheng
University of Science and Technology of China
Speech SynthesisMultimodality-driven Speaker Generation
X
Xin Wang
SpeechTeam, JD Speech Lab, JD Explore Academy, JD.com Inc.