Discrete-time diffusion-like models for speech synthesis

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Continuous-time diffusion models for text-to-speech synthesis suffer from training–sampling conditional mismatch and limitations in modeling additive Gaussian noise. To address these issues, this paper proposes a novel discrete-time diffusion framework. By systematically constructing additive/multiplicative Gaussian noise, blur noise, and hybrid noise processes, it explicitly captures non-Markovian dependencies across discrete timesteps, thereby alleviating strong reliance on continuous-time assumptions. The framework enables significantly fewer inference steps (≤20) and improved stride consistency, enhancing both training and sampling efficiency. Experiments on LJSpeech demonstrate competitive performance: MOS=4.12, STOI=0.958, ESTOI=0.921—on par with state-of-the-art continuous-time models—while achieving 1.8× faster training and 2.3× faster per-step inference.

Technology Category

Application Category

📝 Abstract
Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.
Problem

Research questions and friction points this paper is trying to address.

Addresses mismatch between continuous training and discrete sampling in diffusion models
Explores discrete-time diffusion processes with various noise types for speech synthesis
Aims to achieve efficient training and inference while maintaining speech quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete-time diffusion processes replace continuous-time modeling
Proposes Gaussian, multiplicative, blurring, and mixed noise variants
Achieves comparable speech quality with more efficient inference
🔎 Similar Papers
No similar papers found.
X
Xiaozhou Tan
Department of Computer Science, University of Sheffield, UK
M
Minghui Zhao
Department of Computer Science, University of Sheffield, UK
M
Mattias Cross
Department of Computer Science, University of Sheffield, UK
Anton Ragni
Anton Ragni
University of Sheffield
Speech and Language Technologies