MoonCast: High-Quality Zero-Shot Podcast Generation

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing TTS methods struggle to synthesize long-form, multi-speaker, spontaneous podcast speech, primarily due to limitations in modeling extended contextual dependencies and capturing conversational naturalness. This paper introduces the first zero-shot podcast speech synthesis framework. Our approach comprises three core components: (1) a novel script generator that explicitly models spontaneity—e.g., fillers, repetitions, and discourse markers—for authentic podcast narration; (2) a long-context speech-language model jointly optimized for cross-speaker prosodic consistency and acoustic speaker adaptation; and (3) an end-to-end pipeline that converts raw text (TXT/PDF/web) into coherent, multi-character audio exceeding 10 minutes—without requiring any target-speaker recordings. Experiments demonstrate significant improvements over state-of-the-art methods in spontaneity (+24.7%) and coherence (+19.3%), achieving a MOS of 4.21 and zero-shot speaker similarity of 92.3%.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.
Problem

Research questions and friction points this paper is trying to address.

Generates long, multi-speaker podcast-style speech
Captures spontaneity in oral, informal dialogues
Uses text-only sources for zero-shot podcast generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-context language model for audio generation
Podcast generation module enhances spontaneity
Zero-shot synthesis using unseen speakers' voices
🔎 Similar Papers
No similar papers found.