AudioGS: Spectrogram-Based Audio Gaussian Splatting for Sound Field Reconstruction

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Synthesizing high-fidelity binaural audio from sparse observations remains challenging, as existing approaches rely on visual priors and struggle to accurately model fine-grained acoustic field details. This work proposes AudioGS, the first method to adapt 3D Gaussian Splatting to the audio domain, introducing an explicit, vision-free representation of the sound field. It decomposes spectrograms into a set of audio Gaussians, where each time–frequency unit is associated with dual spherical harmonic coefficients and distance-based attenuation parameters. Binaural signals for target head poses are rendered through phase-corrected synthesis. Being purely audio-driven, AudioGS significantly improves directional accuracy and propagation modeling, outperforming state-of-the-art vision-dependent methods on the Replay-NVAS dataset—reducing magnitude reconstruction error (MAG) by over 14% and perceptual quality metric (DPAM) by approximately 25%.

Technology Category

Application Category

📝 Abstract

Spatial audio is fundamental to immersive virtual experiences, yet synthesizing high-fidelity binaural audio from sparse observations remains a significant challenge. Existing methods typically rely on implicit neural representations conditioned on visual priors, which often struggle to capture fine-grained acoustic structures. Inspired by 3D Gaussian Splatting (3DGS), we introduce AudioGS, a novel visual-free framework that explicitly encodes the sound field as a set of Audio Gaussians based on spectrograms. AudioGS associates each time-frequency bin with an Audio Gaussian equipped with dual Spherical Harmonic (SH) coefficients and a decay coefficient. For a target pose, we render binaural audio by evaluating the SH field to capture directionality, incorporating geometry-guided distance attenuation and phase correction, and reconstructing the waveform. Experiments on the Replay-NVAS dataset demonstrate that AudioGS successfully captures complex spatial cues and outperforms state-of-the-art visual-dependent baselines. Specifically, AudioGS reduces the magnitude reconstruction error (MAG) by over 14% and reduces the perceptual quality metric (DPAM) by approximately 25% compared to the best performing visual-guided method.

Problem

Research questions and friction points this paper is trying to address.

spatial audio

binaural audio synthesis

sound field reconstruction

high-fidelity audio

sparse observations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio Gaussian Splatting

spectrogram-based representation

visual-free sound field reconstruction