🤖 AI Summary
Diffusion vocoders exhibit insufficient robustness when conditioned on mel-spectrograms deviating from the training distribution, leading to degraded audio quality and phase misalignment. To address this, we propose a single-step Griffin-Lim alignment (GLA) mechanism that embeds phase-aware spectral refinement directly into the diffusion denoising process. By performing only one GLA iteration within the reverse diffusion step, our method jointly optimizes magnitude and phase, significantly improving waveform alignment and generation stability under out-of-domain conditions. Crucially, it requires no auxiliary networks or iterative optimization, ensuring computational efficiency and seamless integration. Experiments demonstrate that our approach surpasses state-of-the-art diffusion vocoders in both subjective MOS and objective metrics (STOI, PESQ), with particularly notable gains in generalization to out-of-domain mel inputs. This work establishes a lightweight, effective paradigm for phase modeling in robust speech synthesis.
📝 Abstract
Recent advances in diffusion models have positioned them as powerful generative frameworks for speech synthesis, demonstrating substantial improvements in audio quality and stability. Nevertheless, their effectiveness in vocoders conditioned on mel spectrograms remains constrained, particularly when the conditioning diverges from the training distribution. The recently proposed GLA-Grad model introduced a phase-aware extension to the WaveGrad vocoder that integrated the Griffin-Lim algorithm (GLA) into the reverse process to reduce inconsistencies between generated signals and conditioning mel spectrogram. In this paper, we further improve GLA-Grad through an innovative choice in how to apply the correction. Particularly, we compute the correction term only once, with a single application of GLA, to accelerate the generation process. Experimental results demonstrate that our method consistently outperforms the baseline models, particularly in out-of-domain scenarios.