🤖 AI Summary
This work proposes the Image2AVScene task, which aims to generate explorable 3D audiovisual scenes from a single image to enable immersive experiences through joint visual and spatial audio rendering. The method integrates image outpainting, 3D scene reconstruction, language-guided sound anchoring, and Ambisonics-based audio rendering, achieving for the first time synchronized, free-viewpoint audiovisual synthesis driven solely by a single input image. Evaluated on a newly curated real-world dataset, the approach demonstrates strong performance, with user studies confirming high perceptual quality. Furthermore, the framework unlocks novel applications such as one-shot acoustic learning and audiovisual source separation, highlighting the potential and versatility of 3D audiovisual scene generation.
📝 Abstract
Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/