CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing binaural audio generation models over-rely on room acoustics priors, leading to loss of spatial detail and limitations in volume/timbre controllability and azimuth estimation accuracy. This work addresses the monaural-to-binaural visual-guided generation task. We propose: (1) a novel audio-visual conditional normalization layer enabling cross-modal dynamic feature alignment; (2) contrastive learning with shuffled visual features to enhance sensitivity to subtle spatial disparities; and (3) a lightweight test-time video augmentation strategy to improve generalization. Our method is end-to-end trainable without explicit room parameter estimation. Evaluated on FAIR-Play and MUSIC-Stereo benchmarks, it achieves state-of-the-art performance, significantly improving spatial fidelity and audio-visual semantic consistency in generated binaural audio.

Technology Category

Application Category

📝 Abstract

Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Binaural Audio Generation

3D Audio Effects

Environmental Dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

CCStereo

Audio-Visual 3D Sound Generation

Economical Video Testing Strategy

🔎 Similar Papers

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound