π€ AI Summary
To address high-fidelity time-domain audio reconstruction from Mel spectrograms, this paper proposes a full-band STFT magnitude and phase joint inversion framework based on the Alternating Direction Method of Multipliers (ADMM). Unlike cascaded approaches, our method formulates a non-convex optimization problem and exploits conditional independence among variables to design an efficient alternating update schemeβthe first application of ADMM to joint Mel spectrogram inversion. The framework jointly enforces inverse STFT (iSTFT) consistency and Mel filterbank constraints. Evaluated on speech and Foley sound datasets, it achieves state-of-the-art reconstruction quality: reducing iteration count by 40% compared to existing joint estimation methods, improving STOI by 1.2β1.8 percentage points and PESQ by 0.4β0.7 points. These gains significantly mitigate error accumulation and enhance signal reconstruction fidelity for audio post-production.
π Abstract
Signal reconstruction from its mel-spectrogram is known as mel-spectrogram inversion and has many applications, including speech and foley sound synthesis. In this paper, we propose a mel-spectrogram inversion method based on a rigorous optimization algorithm. To reconstruct a time-domain signal with inverse short-time Fourier transform (STFT), both full-band STFT magnitude and phase should be predicted from a given mel-spectrogram. Their joint estimation has outperformed the cascaded full-band magnitude prediction and phase reconstruction by preventing error accumulation. However, the existing joint estimation method requires many iterations, and there remains room for performance improvement. We present an alternating direction method of multipliers (ADMM)-based joint estimation method motivated by its success in various nonconvex optimization problems including phase reconstruction. An efficient update of each variable is derived by exploiting the conditional independence among the variables. Our experiments demonstrate the effectiveness of the proposed method on speech and foley sounds.