GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Diffusion vocoders exhibit insufficient robustness when conditioned on mel-spectrograms deviating from the training distribution, leading to degraded audio quality and phase misalignment. To address this, we propose a single-step Griffin-Lim alignment (GLA) mechanism that embeds phase-aware spectral refinement directly into the diffusion denoising process. By performing only one GLA iteration within the reverse diffusion step, our method jointly optimizes magnitude and phase, significantly improving waveform alignment and generation stability under out-of-domain conditions. Crucially, it requires no auxiliary networks or iterative optimization, ensuring computational efficiency and seamless integration. Experiments demonstrate that our approach surpasses state-of-the-art diffusion vocoders in both subjective MOS and objective metrics (STOI, PESQ), with particularly notable gains in generalization to out-of-domain mel inputs. This work establishes a lightweight, effective paradigm for phase modeling in robust speech synthesis.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion models have positioned them as powerful generative frameworks for speech synthesis, demonstrating substantial improvements in audio quality and stability. Nevertheless, their effectiveness in vocoders conditioned on mel spectrograms remains constrained, particularly when the conditioning diverges from the training distribution. The recently proposed GLA-Grad model introduced a phase-aware extension to the WaveGrad vocoder that integrated the Griffin-Lim algorithm (GLA) into the reverse process to reduce inconsistencies between generated signals and conditioning mel spectrogram. In this paper, we further improve GLA-Grad through an innovative choice in how to apply the correction. Particularly, we compute the correction term only once, with a single application of GLA, to accelerate the generation process. Experimental results demonstrate that our method consistently outperforms the baseline models, particularly in out-of-domain scenarios.

Problem

Research questions and friction points this paper is trying to address.

Improves vocoder performance for mel spectrogram conditioning

Reduces inconsistencies between generated signals and conditioning

Accelerates generation process with single GLA application

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Griffin-Lim algorithm into diffusion reverse process

Applies single GLA correction to accelerate generation

Enhances vocoder performance for out-of-domain mel spectrograms

🔎 Similar Papers

GALD-SE: Guided Anisotropic Lightweight Diffusion for Efficient Speech Enhancement