LLaDA-TTS: Unifying Speech Synthesis and Zero-Shot Editing via Masked Diffusion Modeling

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high inference latency of conventional autoregressive speech synthesis systems and their inability to support zero-shot editing. The authors propose a parallel speech generation approach based on masked diffusion modeling, which unifies speech synthesis and zero-shot editing tasks with minimal fine-tuning data by leveraging bidirectional attention and efficiently transferring weights from an autoregressive pretrained model. Theoretical analysis reveals the locality property of speech tokens, informing the model architecture. Experimental results demonstrate that the method achieves a 0.98% character error rate (CER) on Mandarin and a 1.96% word error rate (WER) on English within just 64 diffusion steps—matching the performance of autoregressive baselines while offering a 2× speedup—and enables zero-shot word-level editing operations such as insertion, deletion, and substitution.
📝 Abstract
Large language model (LLM)-based text-to-speech (TTS) systems achieve remarkable naturalness via autoregressive (AR) decoding, but require N sequential steps to generate N speech tokens. We present LLaDA-TTS, which replaces the AR LLM with a masked diffusion model that completes generation in a fixed number of parallel steps, decoupling inference latency from sequence length. Remarkably, using only 50 hours of fine-tuning data, we successfully transfer a pretrained AR checkpoint to the masked diffusion paradigm via bidirectional attention. At 64 steps, LLaDA-TTS achieves 0.98% CER (zh) and 1.96% WER (en) on Seed-TTS-Eval, matching the original CosyVoice 3 baseline performance while delivering a 2x LLM-stage speedup--a notable acceleration achieved despite the absence of KV cache, an optimization the AR baseline heavily relies on. Beyond acceleration, the bidirectional architecture naturally enables zero-shot speech editing--including word-level insertion, deletion, and substitution--without any additional training. Theoretically, we prove that AR-pretrained weights are near-optimal for bidirectional masked prediction under the locality property of acoustic tokens, explaining this rapid convergence. This general method modifies only the attention mask and objective, applying seamlessly to any LLM-based AR TTS system. Code and audio samples will be available at https://deft-piroshki-b652b5.netlify.app/.
Problem

Research questions and friction points this paper is trying to address.

text-to-speech
autoregressive decoding
inference latency
zero-shot speech editing
masked diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

masked diffusion modeling
zero-shot speech editing
bidirectional attention
non-autoregressive TTS
LLM-based speech synthesis
🔎 Similar Papers
No similar papers found.
X
Xiaoyu Fan
BRVoice Team, Bairong, Inc., China
H
Huizhi Xie
BRVoice Team, Bairong, Inc., China
Wei Zou
Wei Zou
PKU、Samsung、Baidu、Didi、Ke
SpeechNLPLLMMultimodal
Y
Yunzhang Chen
BRVoice Team, Bairong, Inc., China