Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 4D generation methods focus primarily on visual reconstruction, largely neglecting spatial audio synthesis temporally and geometrically aligned with dynamic 3D scenes—thus limiting immersive audiovisual experiences. This paper introduces the first training-free, end-to-end 4D audiovisual co-generation framework: it reconstructs a dynamic 4D scene from monocular video and extracts mono audio, enables pixel-accurate visual grounding for 3D sound-source trajectory tracking, and synthesizes physically grounded spatial audio via head-related transfer functions (HRTFs) and geometric acoustics. Key contributions include: (1) zero-shot, pixel-level visual sound-source localization; and (2) multi-view-consistent spatial audio generation with strict spatiotemporal alignment. Experiments demonstrate significant improvements over baselines in audio fidelity, spatial consistency, and scene–audio alignment. The method supports real-time, interactive 4D immersive experiences.

Technology Category

Application Category

📝 Abstract
Recent advancements in 4D generation have demonstrated its remarkable capability in synthesizing photorealistic renderings of dynamic 3D scenes. However, despite achieving impressive visual performance, almost all existing methods overlook the generation of spatial audio aligned with the corresponding 4D scenes, posing a significant limitation to truly immersive audiovisual experiences. To mitigate this issue, we propose Sonic4D, a novel framework that enables spatial audio generation for immersive exploration of 4D scenes. Specifically, our method is composed of three stages: 1) To capture both the dynamic visual content and raw auditory information from a monocular video, we first employ pre-trained expert models to generate the 4D scene and its corresponding monaural audio. 2) Subsequently, to transform the monaural audio into spatial audio, we localize and track the sound sources within the 4D scene, where their 3D spatial coordinates at different timestamps are estimated via a pixel-level visual grounding strategy. 3) Based on the estimated sound source locations, we further synthesize plausible spatial audio that varies across different viewpoints and timestamps using physics-based simulation. Extensive experiments have demonstrated that our proposed method generates realistic spatial audio consistent with the synthesized 4D scene in a training-free manner, significantly enhancing the immersive experience for users. Generated audio and video examples are available at https://x-drunker.github.io/Sonic4D-project-page.
Problem

Research questions and friction points this paper is trying to address.

Generates spatial audio for 4D scenes
Transforms monaural audio into spatial audio
Enhances immersive audiovisual experience
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates spatial audio for 4D scenes
Uses pixel-level visual grounding strategy
Employs physics-based audio simulation
🔎 Similar Papers
S
Siyi Xie
University of Science and Technology of China
Hanxin Zhu
Hanxin Zhu
Phd Student of University of Science and Technology of China
3D/4D Reconstruction3D/4D Generation3D/4D Understanding
Tianyu He
Tianyu He
Microsoft Research
machine learninggenerative modelsworld models
X
Xin Li
University of Science and Technology of China
Z
Zhibo Chen
University of Science and Technology of China