🤖 AI Summary
Existing binaural audio generation models over-rely on room acoustics priors, leading to loss of spatial detail and limitations in volume/timbre controllability and azimuth estimation accuracy. This work addresses the monaural-to-binaural visual-guided generation task. We propose: (1) a novel audio-visual conditional normalization layer enabling cross-modal dynamic feature alignment; (2) contrastive learning with shuffled visual features to enhance sensitivity to subtle spatial disparities; and (3) a lightweight test-time video augmentation strategy to improve generalization. Our method is end-to-end trainable without explicit room parameter estimation. Evaluated on FAIR-Play and MUSIC-Stereo benchmarks, it achieves state-of-the-art performance, significantly improving spatial fidelity and audio-visual semantic consistency in generated binaural audio.
📝 Abstract
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.