🤖 AI Summary
This work addresses the challenge of achieving high-directivity spatial audio upmixing from low-channel spherical microphone arrays, where conventional approaches are hindered by the coupling bottleneck between sound source direction estimation and the limited spatial resolution of first-order Ambisonics (FOA). To overcome this limitation, the authors propose SIRUP, the first method to employ a conditional latent diffusion model for Ambisonics upmixing. Specifically, a variational autoencoder (VAE) compresses high-order Ambisonics (HOA) into a latent space, and a diffusion model is trained—conditioned on FOA inputs—to generate high-fidelity HOA embeddings. This approach effectively decouples direction estimation from spatial resolution constraints, significantly outperforming existing FOA-based systems in directional upmixing, sound source localization, and speech denoising, thereby enhancing both the directivity and reconstruction quality of spatial audio.
📝 Abstract
This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering the higher-order ambisonics (HOA) data using a physics-based acoustic simulator. This approach, however, struggles to handle the mutual dependency between the spatial directivity of source estimation and the spatial resolution of FOA ambisonics data. Our method, named SIRUP, employs a latent diffusion model architecture. Specifically, a variational autoencoder (VAE) is used to learn a compact encoding of the HOA data in a latent space and a diffusion model is then trained to generate the HOA embeddings, conditioned by the FOA data. Experimental results showed that SIRUP achieved a significant improvement compared to FOA systems for steering vector upmixing, source localization, and speech denoising.