StereoFoley: Object-Aware Stereo Audio Generation from Video

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-to-audio generation methods are predominantly monaural and lack object-aware stereo modeling, primarily due to the absence of spatially accurate, semantically rich stereo audio-visual paired datasets. To address this, we propose the first end-to-end object-aware stereo audio generation framework. Our method introduces a synthetic data pipeline integrating object tracking and dynamic panning localization to construct high-fidelity video–stereo audio pairs. We further design a distance-aware loudness control module and a dynamic panning mechanism to achieve spatially precise audio synthesis. Additionally, we define a stereo object-awareness metric and validate its perceptual relevance through psychoacoustic evaluation. The base model achieves state-of-the-art performance in semantic and temporal alignment; fine-tuning on our synthetic dataset significantly improves object–audio spatial correspondence. Human listening experiments confirm substantial gains in stereo realism and spatial fidelity.

Technology Category

Application Category

📝 Abstract
We present StereoFoley, a video-to-audio generation framework that produces semantically aligned, temporally synchronized, and spatially accurate stereo sound at 48 kHz. While recent generative video-to-audio models achieve strong semantic and temporal fidelity, they largely remain limited to mono or fail to deliver object-aware stereo imaging, constrained by the lack of professionally mixed, spatially accurate video-to-audio datasets. First, we develop and train a base model that generates stereo audio from video, achieving state-of-the-art in both semantic accuracy and synchronization. Next, to overcome dataset limitations, we introduce a synthetic data generation pipeline that combines video analysis, object tracking, and audio synthesis with dynamic panning and distance-based loudness controls, enabling spatially accurate object-aware sound. Finally, we fine-tune the base model on this synthetic dataset, yielding clear object-audio correspondence. Since no established metrics exist, we introduce stereo object-awareness measures and validate it through a human listening study, showing strong correlation with perception. This work establishes the first end-to-end framework for stereo object-aware video-to-audio generation, addressing a critical gap and setting a new benchmark in the field.
Problem

Research questions and friction points this paper is trying to address.

Generating spatially accurate stereo audio from video input
Overcoming limitations of existing mono audio generation models
Establishing object-aware sound correspondence without professional datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates stereo audio from video using base model
Synthetic data pipeline with panning and loudness controls
Fine-tunes model for object-aware stereo sound correspondence
🔎 Similar Papers
No similar papers found.
T
Tornike Karchkhadze
UC San Diego
K
Kuan-Lin Chen
Apple
M
Mojtaba Heydari
Apple
R
Robert Henzel
Apple
A
Alessandro Toso
Apple
Mehrez Souden
Mehrez Souden
Sr. Manager, Apple Inc.
Audio and Speech processingMachine LearningSignal Processing
Joshua Atkins
Joshua Atkins
Senior Manager - Audio Algorithms, Apple Inc.
Signal Processing for Audio and Speech