🤖 AI Summary
Existing methods struggle to synthesize physically plausible spatial audio from novel viewpoints, primarily due to their neglect of critical acoustic factors such as global geometric structure and material semantics. This work proposes the first physics-aware framework for novel-view acoustic synthesis, which reconstructs a 3D acoustic environment from multi-view images and integrates physical semantic priors—such as material properties and scene layout—extracted via vision-language models. By jointly leveraging geometric and semantic cues, the method enables dual-driven audio generation that significantly outperforms current approaches on the RWAVS dataset, achieving notable advances in both perceptual realism and physical consistency of binaural audio.
📝 Abstract
Spatial audio is essential for immersive experiences, yet novel-view acoustic synthesis (NVAS) remains challenging due to complex physical phenomena such as reflection, diffraction, and material absorption. Existing methods based on single-view or panoramic inputs improve spatial fidelity but fail to capture global geometry and semantic cues such as object layout and material properties. To address this, we propose Phys-NVAS, the first physics-aware NVAS framework that integrates spatial geometry modeling with vision-language semantic priors. A global 3D acoustic environment is reconstructed from multi-view images and depth maps to estimate room size and shape, enhancing spatial awareness of sound propagation. Meanwhile, a vision-language model extracts physics-aware priors of objects, layouts, and materials, capturing absorption and reflection beyond geometry. An acoustic feature fusion adapter unifies these cues into a physics-aware representation for binaural generation. Experiments on RWAVS demonstrate that Phys-NVAS yields binaural audio with improved realism and physical consistency.