MOSS-TTS Technical Report

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses key challenges in multilingual, open-domain high-quality speech synthesis—namely zero-shot voice cloning, long-form generation stability, and fine-grained control over pronunciation and prosody—by proposing a streamlined yet scalable foundational speech generation model. Built upon discrete audio tokens and an autoregressive Transformer architecture, the approach introduces a causal Transformer-based audio tokenizer (MOSS-Audio-Tokenizer) and variable-bitrate residual vector quantization (RVQ) to construct a unified semantic-acoustic representation. It further incorporates a frame-wise local autoregressive module and a dual-generator mechanism, balancing modeling efficiency with deployment flexibility. The resulting model supports zero-shot voice cloning, phoneme- or pinyin-level pronunciation control, token-level duration adjustment, seamless code-switching, and low-latency initial-syllable output, enabling stable, high-fidelity, and speaker-consistent long-form speech synthesis across multiple languages.

Technology Category

Application Category

📝 Abstract

This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

Problem

Research questions and friction points this paper is trying to address.

text-to-speech

voice cloning

speech generation

multilingual TTS

long-form synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete audio tokens

autoregressive modeling

variable-bitrate RVQ