๐ค AI Summary
This work addresses the limited spatial awareness and immersion in existing video-to-audio generation methods, primarily due to the absence of large-scale datasets containing binaural spatial audio. To overcome this, we introduce BinauralVGGSound, the first large-scale videoโbinaural audio dataset, and propose an end-to-end vision-guided framework for spatial audio synthesis. Our approach explicitly models spatial cues and enforces cross-modal alignment to generate high-fidelity audio with realistic directional and depth perception. Experimental results demonstrate that the proposed method significantly outperforms state-of-the-art models in spatial fidelity while maintaining strong semantic and temporal consistency, thereby substantially enhancing auditory immersion.
๐ Abstract
While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models'reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial attributes and layered spatial depth while maintaining semantic and temporal alignment. Experiments show that our approach substantially outperforms state-of-the-art models in spatial fidelity and delivers a more immersive auditory experience, without sacrificing temporal or semantic consistency. The demo page can be accessed at https://github.com/renlinjie868-web/SpatialV2A.