FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications

📅 2024-09-05

🏛️ arXiv.org

📈 Citations: 11

✨ Influential: 1

career value

165K/year

🤖 AI Summary

Existing end-to-end TTS systems struggle with zero-shot voice cloning, emotion-controllable conversational speech generation, and contextual adaptability—critical challenges for industrial-grade generative voice applications. Method: We propose a novel end-to-end TTS framework integrating a semantic-aware speech tokenizer with an LLM-driven discrete speech token modeling module, coupled with a two-stage high-fidelity waveform generator. We further introduce instruction tuning and few-shot adaptation to enable zero-shot cloning and fine-tuning from as little as one hour of speaker data. Contribution/Results: Evaluated in UGC/PUGC scenarios, our framework generates human-like conversational speech with controllable prosody, paralinguistic behaviors (e.g., laughter, pauses), and strong in-context learning capabilities. It significantly improves naturalness, expressiveness, and generalization across downstream tasks compared to prior approaches.

Technology Category

Application Category

📝 Abstract

This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the growing demands for personalized and diverse generative speech applications. The framework comprises three parts: data processing, foundation system, and downstream applications. First, we comprehensively present our data processing pipeline, which transforms massive raw audio into a large-scale high-quality TTS dataset with rich annotations and a wide coverage of content, speaking style, and timbre. Then, we propose a language-model-based foundation TTS system. The speech signal is compressed into discrete semantic tokens via a semantic-aware speech tokenizer, and can be generated by a language model from the prompt text and audio. Then, a two-stage waveform generator is proposed to decode them to the high-fidelity waveform. We present two applications of this system: voice cloning for dubbing and human-like speech generation for chatbots. The experimental results demonstrate the solid in-context learning capability of FireRedTTS, which can stably synthesize high-quality speech consistent with the prompt text and audio. For dubbing, FireRedTTS can clone target voices in a zero-shot way for the UGC scenario and adapt to studio-level expressive voice characters in the PUGC scenario via few-shot fine-tuning with 1-hour recording. Moreover, FireRedTTS achieves controllable human-like speech generation in a casual style with paralinguistic behaviors and emotions via instruction tuning, to better serve spoken chatbots.

Problem

Research questions and friction points this paper is trying to address.

Develops a foundation TTS framework for personalized, diverse speech generation

Addresses high-quality speech synthesis with rich annotations and wide coverage

Enables voice cloning and human-like chatbot speech via adaptive learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-aware speech tokenizer for discrete tokens

Two-stage waveform generator for high-fidelity output

Instruction tuning for controllable human-like speech

🔎 Similar Papers

No similar papers found.