Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Current open-source TTS models lack podcast-specific architectural design, complete training codebases, and efficient inference support, hindering real-world deployment. This work introduces the first open-source, trainable, podcast-optimized TTS model, integrating an LLM-based speech synthesis architecture with pretraining on over 100,000 hours of podcast audio. It enables zero-shot voice synthesis and speaker adaptation within minutes. We propose a novel end-to-end training pipeline, a lightweight inference engine, and a low-latency deployment framework—achieving state-of-the-art performance in naturalness, speaker similarity, and inference speed. All code, pretrained model weights, and data processing scripts are publicly released. The system supports real-time TTS inference on a single GPU and incurs controllable training costs (~$50K), significantly enhancing accessibility and engineering practicality of TTS in voice-interaction applications.

Technology Category

Application Category

📝 Abstract

Recent advancements in text-to-speech (TTS) models have been driven by the integration of large language models (LLMs), enhancing semantic comprehension and improving speech naturalness. However, existing LLM-based TTS models often lack open-source training code and efficient inference acceleration frameworks, limiting their accessibility and adaptability. Additionally, there is no publicly available TTS model specifically optimized for podcast scenarios, which are in high demand for voice interaction applications. To address these limitations, we introduce Muyan-TTS, an open-source trainable TTS model designed for podcast applications within a $50,000 budget. Our model is pre-trained on over 100,000 hours of podcast audio data, enabling zero-shot TTS synthesis with high-quality voice generation. Furthermore, Muyan-TTS supports speaker adaptation with dozens of minutes of target speech, making it highly customizable for individual voices. In addition to open-sourcing the model, we provide a comprehensive data collection and processing pipeline, a full training procedure, and an optimized inference framework that accelerates LLM-based TTS synthesis. Our code and models are available at https://github.com/MYZY-AI/Muyan-TTS.

Problem

Research questions and friction points this paper is trying to address.

Lack of open-source LLM-based TTS training code

No podcast-optimized TTS model publicly available

High cost and limited accessibility of current TTS models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source trainable TTS model for podcasts

Pre-trained on 100,000 hours of podcast audio

Optimized inference framework for LLM-based TTS

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Research Engineer, Voice

Inflection AI

$225,000 to $325,000, depending on a candidate’s qualifications and level of experience. This role also includes a meaningful equity component, allowing employees to share in the long-term success of the company.

Palo Alto, California, United States / Palo Alto, Palo Alto, California, United States

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs