π€ AI Summary
Current long-form speech generation models are constrained by binary turn-taking, limiting flexible, unbounded, and multi-speaker (β€8) human-like dialogue synthesis. To address this, we propose E2E-Transformer-DiTβthe first end-to-end unified architecture integrating Transformer and diffusion modeling for joint semantic, prosodic, and acoustic representation learning. We design a 12.5 Hz low-bit multi-task MM-Tokenizer with MMSE-based quantization for efficient latent encoding, and incorporate large-scale text perturbation to enhance frontend robustness. Evaluated on Seed-TTS-Eval and multi-speaker long-dialogue cloning tasks, our method achieves state-of-the-art performance, significantly improving prosodic coherence, speaker-specific rhythmic diversity, paralinguistic naturalness, and speech intelligibility. It supports zero-shot cross-lingual voice cloning across Chinese, English, Japanese, and Korean.
π Abstract
Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce JoyVoice, a novel anthropomorphic foundation model designed for flexible, boundary-free synthesis of up to eight speakers. Unlike conventional cascaded systems, JoyVoice employs a unified E2E-Transformer-DiT architecture that utilizes autoregressive hidden representations directly for diffusion inputs, enabling holistic end-to-end optimization. We further propose a MM-Tokenizer operating at a low bitrate of 12.5 Hz, which integrates multitask semantic and MMSE losses to effectively model both semantic and acoustic information. Additionally, the model incorporates robust text front-end processing via large-scale data perturbation. Experiments show that JoyVoice achieves state-of-the-art results in multilingual generation (Chinese, English, Japanese, Korean) and zero-shot voice cloning. JoyVoice achieves top-tier results on both the Seed-TTS-Eval Benchmark and multi-speaker long-form conversational voice cloning tasks, demonstrating superior audio quality and generalization. It achieves significant improvements in prosodic continuity for long-form speech, rhythm richness in multi-speaker conversations, paralinguistic naturalness, besides superior intelligibility. We encourage readers to listen to the demo at https://jea-speech.github.io/JoyVoice