Score-Based Training for Energy-Based TTS Models

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability and low fidelity of first-order sampling (e.g., Langevin dynamics) in energy-based text-to-speech (TTS) models, stemming from noise contrastive estimation (NCE) and sliced score matching (SSM) ignoring the structural properties of the log-likelihood gradient. We propose the first score learning objective that explicitly incorporates the log-likelihood gradient structure, achieved by unifying variants of score matching with first-order optimization theory to jointly model energy functions and diffusion processes. Evaluated on multiple TTS benchmarks, our method significantly outperforms NCE and SSM: it improves mean opinion scores (MOS) by over 0.3, reduces sampling steps by 40%, and maintains speech quality stability. The core contribution is a novel training paradigm for energy-based models specifically tailored to first-order inference, enabling robust and efficient sampling without sacrificing perceptual fidelity.

Technology Category

Application Category

📝 Abstract
Noise contrastive estimation (NCE) is a popular method for training energy-based models (EBM) with intractable normalisation terms. The key idea of NCE is to learn by comparing unnormalised log-likelihoods of the reference and noisy samples, thus avoiding explicitly computing normalisation terms. However, NCE critically relies on the quality of noisy samples. Recently, sliced score matching (SSM) has been popularised by closely related diffusion models (DM). Unlike NCE, SSM learns a gradient of log-likelihood, or score, by learning distribution of its projections on randomly chosen directions. However, both NCE and SSM disregard the form of log-likelihood function, which is problematic given that EBMs and DMs make use of first-order optimisation during inference. This paper proposes a new criterion that learns scores more suitable for first-order schemes. Experiments contrasts these approaches for training EBMs.
Problem

Research questions and friction points this paper is trying to address.

Improving noisy sample quality in NCE training
Enhancing score learning for first-order optimization
Comparing NCE and SSM for energy-based models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses noise contrastive estimation for training
Applies sliced score matching for gradient learning
Proposes new criterion for first-order optimization
🔎 Similar Papers
No similar papers found.
W
Wanli Sun
School of Computer Science, University of Sheffield, Sheffield, UK
Anton Ragni
Anton Ragni
University of Sheffield
Speech and Language Technologies