TTS-1 Technical Report

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the demand for high-quality, real-time, multilingual text-to-speech (TTS) systems with fine-grained controllability over emotion and non-linguistic vocalizations, this paper introduces SpeechLM—a dual-Transformer autoregressive speech-language model—comprising TTS-1 (1.6B parameters, enabling low-latency 48 kHz on-device synthesis) and TTS-1-Max (8.8B parameters, achieving state-of-the-art audio fidelity). We propose a novel three-stage alignment pipeline: pretraining → supervised fine-tuning → reinforcement learning from human feedback (RLHF), enabling context-based, token-level control over multilingual output (11 languages), nuanced prosody, and non-linguistic sounds (e.g., laughter, sighs) without explicit conditioning. To our knowledge, this is the first multilingual TTS framework unifying semantic, prosodic, and paralinguistic modeling. Both models establish new SOTA across multiple benchmarks. All code and training details are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce Inworld TTS-1, a set of two Transformer-based autoregressive text-to-speech (TTS) models. Our largest model, TTS-1-Max, has 8.8B parameters and is designed for utmost quality and expressiveness in demanding applications. TTS-1 is our most efficient model, with 1.6B parameters, built for real-time speech synthesis and on-device use cases. By scaling train-time compute and applying a sequential process of pre-training, fine-tuning, and RL-alignment of the speech-language model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality relying purely on in-context learning of the speaker's voice. Inworld TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency, and support 11 languages with fine-grained emotional control and non-verbal vocalizations through audio markups. We additionally open-source our training and modeling code under an MIT license.
Problem

Research questions and friction points this paper is trying to address.

Develop high-quality expressive TTS models for demanding applications
Enable real-time speech synthesis for on-device use cases
Achieve multilingual emotional vocal control with non-verbal sounds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based autoregressive TTS models
Scaled train-time compute and RL-alignment
High-resolution 48 kHz speech with low latency
🔎 Similar Papers
No similar papers found.
O
Oleg Atamanenko
A
Anna Chalova
J
Joseph Coombes
N
Nikki Cope
P
Phillip Dang
Z
Zhifeng Deng
J
Jimmy Du
M
Michael Ermolenko
F
Feifan Fan
Yufei Feng
Yufei Feng
Bytedance
Information RetrievalRecommender SystemClick-Through Rate Prediction
C
Cheryl Fichter
P
Pavel Filimonov
L
Louis Fischer
K
Kylan Gibbs
V
Valeria Gusarova
P
Pavel Karpik
A
Andreas Assad Kottner
I
Ian Lee
O
Oliver Louie
J
Jasmine Mai
M
Mikhail Mamontov
S
Suri Mao
N
Nurullah Morshed
I
Igor Poletaev
F
Florin Radu