StereoFoley: Object-Aware Stereo Audio Generation from Video

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing video-to-audio generation methods are predominantly monaural and lack object-aware stereo modeling, primarily due to the absence of spatially accurate, semantically rich stereo audio-visual paired datasets. To address this, we propose the first end-to-end object-aware stereo audio generation framework. Our method introduces a synthetic data pipeline integrating object tracking and dynamic panning localization to construct high-fidelity video–stereo audio pairs. We further design a distance-aware loudness control module and a dynamic panning mechanism to achieve spatially precise audio synthesis. Additionally, we define a stereo object-awareness metric and validate its perceptual relevance through psychoacoustic evaluation. The base model achieves state-of-the-art performance in semantic and temporal alignment; fine-tuning on our synthetic dataset significantly improves object–audio spatial correspondence. Human listening experiments confirm substantial gains in stereo realism and spatial fidelity.

Technology Category

Application Category

📝 Abstract

We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.

Problem

Research questions and friction points this paper is trying to address.

Generating spatially accurate stereo audio from video input

Overcoming limitations of existing mono audio generation models

Establishing object-aware sound correspondence without professional datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates stereo audio from video using base model

Synthetic data pipeline with panning and loudness controls

Fine-tunes model for object-aware stereo sound correspondence

🔎 Similar Papers

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound