StereoSync: Spatially-Aware Stereo Audio Generation from Video

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing video-to-audio generation methods focus solely on temporal alignment while neglecting spatial structure modeling, resulting in audio lacking spatial immersion and scene consistency. To address this, we propose the first spatially aware framework for this task, built upon a diffusion model with dual-modality alignment. Our method extracts 3D spatial cues—including depth maps and target bounding boxes—and employs them as cross-attention conditioning to guide stereo audio generation, enabling joint spatiotemporal alignment between audio and video. Leveraging pretrained visual foundation models, it implicitly models dynamic sound-source localization without requiring spatial audio annotations. Evaluated on the Walking The Maps dataset, our approach significantly improves audio realism and immersive quality. Both qualitative and quantitative evaluations demonstrate state-of-the-art performance, outperforming prior methods in metrics including Frechet Audio Distance (FAD), stereo perceptual similarity, and human preference scores.

Technology Category

Application Category

📝 Abstract

Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.

Problem

Research questions and friction points this paper is trying to address.

Generating spatially-aware stereo audio synchronized with video

Incorporating spatial cues from depth maps and bounding boxes

Advancing video-to-audio generation beyond temporal synchronization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates stereo audio spatially aligned with video

Uses depth maps and bounding boxes as cross-attention conditioning

Leverages pretrained models for efficient high-quality synthesis

🔎 Similar Papers

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound