🤖 AI Summary
This work addresses vision-guided 3D acoustic modeling: synthesizing realistic impact sounds from an object’s 3D geometry and inversely localizing the 3D spatial position of impact based on the sound. To this end, we introduce the first 3D-aligned visual-acoustic paired dataset. We propose a unified framework integrating feature-enhanced 3D Gaussian Splatting with a conditional diffusion model, enabling end-to-end co-optimization of sound synthesis and source localization. Our method synergizes multi-scale neural rendering with a custom-built 3D audio-visual acquisition pipeline, supporting bidirectional cross-modal inference. Evaluated on our novel dataset, the approach achieves high-fidelity impact sound generation and centimeter-level 3D impact localization accuracy—substantially outperforming existing 2D or unidirectional methods. This establishes a new paradigm for embodied intelligence and interactive audio synthesis.
📝 Abstract
Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.