SonoWorld: From One Image to a 3D Audio-Visual Scene

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work proposes the Image2AVScene task, which aims to generate explorable 3D audiovisual scenes from a single image to enable immersive experiences through joint visual and spatial audio rendering. The method integrates image outpainting, 3D scene reconstruction, language-guided sound anchoring, and Ambisonics-based audio rendering, achieving for the first time synchronized, free-viewpoint audiovisual synthesis driven solely by a single input image. Evaluated on a newly curated real-world dataset, the approach demonstrates strong performance, with user studies confirming high perceptual quality. Furthermore, the framework unlocks novel applications such as one-shot acoustic learning and audiovisual source separation, highlighting the potential and versatility of 3D audiovisual scene generation.

Technology Category

Application Category

📝 Abstract

Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/

Problem

Research questions and friction points this paper is trying to address.

3D audio-visual scene

image-to-scene generation

spatial audio

single-image input

audio-visual rendering

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D audio-visual scene generation

spatial audio rendering

image-to-3D