DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses autoregressive generation of continuous speech representations. We propose a block-wise autoregressive framework that integrates language modeling with diffusion Transformers. Methodologically, we introduce a novel “aggregate-then-generate” blocking paradigm to significantly reduce sequence modeling complexity; further, we design a dynamic temperature scheduling mechanism grounded in inverse-diffusion ODE time steps, enabling fine-grained control over the trade-off between generation diversity and determinism. Experiments demonstrate that our framework achieves state-of-the-art performance in zero-shot speech synthesis across three core metrics: robustness, speaker similarity, and naturalness. Moreover, the model exhibits strong scalability and computational efficiency—delivering high-fidelity speech representation generation at reduced computational cost. This work establishes a new paradigm for efficient, high-quality speech representation modeling.

Technology Category

Application Category

📝 Abstract
Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.
Problem

Research questions and friction points this paper is trying to address.

Autoregressive speech generation without discrete tokens
Reducing computational load in continuous token models
Enhancing speech generation robustness and scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines diffusion and autoregressive models
Utilizes patch-based autoregressive framework
Introduces noise for balance diversity
🔎 Similar Papers
No similar papers found.
Dongya Jia
Dongya Jia
ByteDance Seed
Generative ModelLLMAudio Generation
Z
Zhuo Chen
ByteDance
J
Jiawei Chen
ByteDance
Chenpeng Du
Chenpeng Du
ByteDance
Speech Interaction
J
Jian Wu
ByteDance
Jian Cong
Jian Cong
ByteDance Seed
speech
Xiaobin Zhuang
Xiaobin Zhuang
Bytedance
Audio Generation
C
Chumin Li
ByteDance
Zhen Wei
Zhen Wei
ByteDance
Y
Yuping Wang
ByteDance
Y
Yuxuan Wang
ByteDance