Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Existing methods struggle to achieve fine-grained, viewpoint-controllable audio-visual generation in 360° immersive environments, particularly failing to model the influence of off-screen events on the target viewpoint. To address this, we propose the first controllable multimodal audio-visual generation framework tailored for 360° scenes. Our method leverages panoramic saliency maps, bounding-box-augmented signed distance fields, and scene textual descriptions as spatially aware conditioning inputs to guide a diffusion model in synthesizing temporally coherent, viewpoint-aligned audio-visual content. Crucially, we introduce a novel geometry-semantic joint conditioning mechanism that explicitly encodes contextual influences from occluded or unseen regions during generation. Extensive experiments demonstrate that our approach precisely adheres to multimodal control signals while preserving cross-viewpoint spatial consistency and physical plausibility—significantly enhancing both realism and controllability in immersive audio-visual experiences.

Technology Category

Application Category

📝 Abstract

The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive 360-degree environments. This limitation restricts the creation of audio-visual experiences that are aware of off-camera events. To the best of our knowledge, this is the first work to introduce a framework for controllable audio-visual generation, addressing this unexplored gap. Specifically, we propose a diffusion model by introducing a set of powerful conditioning signals derived from the full 360-degree space: a panoramic saliency map to identify regions of interest, a bounding-box-aware signed distance map to define the target viewpoint, and a descriptive caption of the entire scene. By integrating these controls, our model generates spatially-aware viewpoint videos and audios that are coherently influenced by the broader, unseen environmental context, introducing a strong controllability that is essential for realistic and immersive audio-visual generation. We show audiovisual examples proving the effectiveness of our framework.

Problem

Research questions and friction points this paper is trying to address.

Generating viewpoint-specific content from 360-degree environments

Lacking fine-grained control in audio-visual video generation

Creating immersive audio-visual experiences with spatial awareness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses panoramic saliency maps for interest region identification

Employs bounding-box-aware signed distance maps

Integrates descriptive scene captions for context coherence

🔎 Similar Papers

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound