🤖 AI Summary
Continuous-time diffusion models for text-to-speech synthesis suffer from training–sampling conditional mismatch and limitations in modeling additive Gaussian noise. To address these issues, this paper proposes a novel discrete-time diffusion framework. By systematically constructing additive/multiplicative Gaussian noise, blur noise, and hybrid noise processes, it explicitly captures non-Markovian dependencies across discrete timesteps, thereby alleviating strong reliance on continuous-time assumptions. The framework enables significantly fewer inference steps (≤20) and improved stride consistency, enhancing both training and sampling efficiency. Experiments on LJSpeech demonstrate competitive performance: MOS=4.12, STOI=0.958, ESTOI=0.921—on par with state-of-the-art continuous-time models—while achieving 1.8× faster training and 2.3× faster per-step inference.
📝 Abstract
Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.