🤖 AI Summary
This paper addresses the problem of high-fidelity audio synthesis between source and target sounds via spectral interpolation. Methodologically, it introduces a global audio spectrogram interpolation framework based on optimal transport (OT), featuring a novel structured transportation cost matrix that penalizes long-range energy displacement along the time axis, coupled with unbalanced OT to compute the Wasserstein barycenter for spectrogram-level interpolation. Phase reconstruction is then performed via the Griffin–Lim algorithm, followed by inverse short-time Fourier transform to yield time-domain waveforms—thereby avoiding temporal discontinuities inherent in frame-wise interpolation. Experiments demonstrate significant improvements in interpolation naturalness and spectral fidelity for both musical tones and real-world environmental sounds, while achieving higher computational efficiency than standard OT-based approaches. This work establishes a new paradigm for audio interpolation grounded in principled geometric signal processing.
📝 Abstract
We present a novel approach for generating an artificial audio signal that interpolates between given source and target sounds. Our approach relies on the computation of Wasserstein barycenters of the source and target spectrograms, followed by phase reconstruction and inversion. In contrast with previous works, our new method considers the spectrograms globally and does not operate on a temporal frame-to-frame basis. An other contribution is to endow the transportation cost matrix with a specific structure that prohibits remote displacements of energy along the time axis, and for which optimal transport is made possible by leveraging the unbalanced transport framework. The proposed cost matrix makes sense from the audio perspective and also allows to reduce the computation load. Results with synthetic musical notes and real environmental sounds illustrate the potential of our novel approach.