MOSS-TTS Technical Report

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in multilingual, open-domain high-quality speech synthesis—namely zero-shot voice cloning, long-form generation stability, and fine-grained control over pronunciation and prosody—by proposing a streamlined yet scalable foundational speech generation model. Built upon discrete audio tokens and an autoregressive Transformer architecture, the approach introduces a causal Transformer-based audio tokenizer (MOSS-Audio-Tokenizer) and variable-bitrate residual vector quantization (RVQ) to construct a unified semantic-acoustic representation. It further incorporates a frame-wise local autoregressive module and a dual-generator mechanism, balancing modeling efficiency with deployment flexibility. The resulting model supports zero-shot voice cloning, phoneme- or pinyin-level pronunciation control, token-level duration adjustment, seamless code-switching, and low-latency initial-syllable output, enabling stable, high-fidelity, and speaker-consistent long-form speech synthesis across multiple languages.

Technology Category

Application Category

📝 Abstract
This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
Problem

Research questions and friction points this paper is trying to address.

text-to-speech
voice cloning
speech generation
multilingual TTS
long-form synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete audio tokens
autoregressive modeling
variable-bitrate RVQ
zero-shot voice cloning
long-form speech synthesis
🔎 Similar Papers
No similar papers found.
Y
Yitian Gong
B
Botian Jiang
Y
Yiwei Zhao
Y
Yucheng Yuan
K
Kuangwei Chen
Y
Yaozhou Jiang
C
Cheng Chang
D
Dong Hong
M
Mingshu Chen
R
Ruixiao Li
Y
Yiyang Zhang
Y
Yang Gao
H
Hanfu Chen
K
Ke Chen
Songlin Wang
Songlin Wang
R&D Engineer, JD.com
Information RetrievalNatural Language Processing
X
Xiaogui Yang
Y
Yuqian Zhang
K
Kexin Huang
Z
ZhengYuan Lin
K
Kang Yu
Z
Ziqi Chen
J
Jin Wang
Zhaoye Fei
Zhaoye Fei
Fudan University
Natural Language Processing
Q
Qinyuan Cheng
Shimin Li
Shimin Li
Fudan University
Large Language ModelSpeech Language Model