FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing conversational TTS systems rely on complete text inputs, producing non-streaming, single-speaker audio—limiting real-time interactivity and suffering from instability, inaccurate speaker switching, and discontinuous prosody. To address these issues, we propose a streaming multi-speaker TTS framework for long dialogues: (1) a 12.5 Hz low-latency speech tokenizer with interleaved text–speech modeling; (2) a dual-Transformer architecture—comprising a large decoder-only model for semantic coherence and a compact deep Transformer for latency reduction and fine-grained prosody modeling; and (3) context-aware prosody integration. Experiments demonstrate significant improvements over MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in intelligibility, speaker-switching accuracy, and naturalness. Our method enables high-fidelity streaming synthesis, integrates seamlessly with mainstream dialogue systems, and is suitable for interactive applications such as podcasts and chatbots.

Technology Category

Application Category

📝 Abstract

Current dialogue generation approaches typically require the complete dialogue text before synthesis and produce a single, inseparable speech containing all voices, making them unsuitable for interactive chat; moreover, they suffer from unstable synthesis, inaccurate speaker transitions, and incoherent prosody. In this work, we present FireRedTTS-2, a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation for real-time applications. We adopt a text-speech interleaved format, concatenating speaker-labeled text with aligned speech tokens in chronological order, and model it with a dual-transformer: a large decoder-only transformer predicts tokens at the first layer, and a smaller one completes subsequent layers. Experimental results show that FireRedTTS-2 integrates seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn reliability, and perceived naturalness with context-consistent prosody. Our demos are available at https://fireredteam.github.io/demos/firered_tts_2.

Problem

Research questions and friction points this paper is trying to address.

Generating long conversational speech for podcasts and chatbots

Overcoming unstable synthesis and inaccurate speaker transitions

Enabling real-time streaming with coherent prosody and reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming speech tokenizer for efficient training

Text-speech interleaved format with speaker labels

Dual-transformer architecture for token prediction

🔎 Similar Papers

FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications

2024-09-05arXiv.orgCitations: 11

💼 Related Jobs

Research Engineer, Voice

Inflection AI

$225,000 to $325,000, depending on a candidate’s qualifications and level of experience. This role also includes a meaningful equity component, allowing employees to share in the long-term success of the company.

Palo Alto, California, United States / Palo Alto, Palo Alto, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Lab