🤖 AI Summary
Existing TTS methods struggle to synthesize long-form, multi-speaker, spontaneous podcast speech, primarily due to limitations in modeling extended contextual dependencies and capturing conversational naturalness. This paper introduces the first zero-shot podcast speech synthesis framework. Our approach comprises three core components: (1) a novel script generator that explicitly models spontaneity—e.g., fillers, repetitions, and discourse markers—for authentic podcast narration; (2) a long-context speech-language model jointly optimized for cross-speaker prosodic consistency and acoustic speaker adaptation; and (3) an end-to-end pipeline that converts raw text (TXT/PDF/web) into coherent, multi-character audio exceeding 10 minutes—without requiring any target-speaker recordings. Experiments demonstrate significant improvements over state-of-the-art methods in spontaneity (+24.7%) and coherence (+19.3%), achieving a MOS of 4.21 and zero-shot speaker similarity of 92.3%.
📝 Abstract
Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.