🤖 AI Summary
Existing video-to-audio generation methods are predominantly monaural and lack object-aware stereo modeling, primarily due to the absence of spatially accurate, semantically rich stereo audio-visual paired datasets. To address this, we propose the first end-to-end object-aware stereo audio generation framework. Our method introduces a synthetic data pipeline integrating object tracking and dynamic panning localization to construct high-fidelity video–stereo audio pairs. We further design a distance-aware loudness control module and a dynamic panning mechanism to achieve spatially precise audio synthesis. Additionally, we define a stereo object-awareness metric and validate its perceptual relevance through psychoacoustic evaluation. The base model achieves state-of-the-art performance in semantic and temporal alignment; fine-tuning on our synthetic dataset significantly improves object–audio spatial correspondence. Human listening experiments confirm substantial gains in stereo realism and spatial fidelity.
📝 Abstract
We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.