SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation

๐Ÿ“… 2026-01-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limited spatial awareness and immersion in existing video-to-audio generation methods, primarily due to the absence of large-scale datasets containing binaural spatial audio. To overcome this, we introduce BinauralVGGSound, the first large-scale videoโ€“binaural audio dataset, and propose an end-to-end vision-guided framework for spatial audio synthesis. Our approach explicitly models spatial cues and enforces cross-modal alignment to generate high-fidelity audio with realistic directional and depth perception. Experimental results demonstrate that the proposed method significantly outperforms state-of-the-art models in spatial fidelity while maintaining strong semantic and temporal consistency, thereby substantially enhancing auditory immersion.

Technology Category

Application Category

๐Ÿ“ Abstract
While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models'reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial attributes and layered spatial depth while maintaining semantic and temporal alignment. Experiments show that our approach substantially outperforms state-of-the-art models in spatial fidelity and delivers a more immersive auditory experience, without sacrificing temporal or semantic consistency. The demo page can be accessed at https://github.com/renlinjie868-web/SpatialV2A.
Problem

Research questions and friction points this paper is trying to address.

spatial audio generation
binaural audio
video-to-audio
spatial perception
immersive audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial audio generation
binaural audio
visual-guided synthesis
video-to-audio
audio spatialization
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yanan Wang
School of Computer Science and Technology, Shandong University
L
Linjie Ren
School of Computer Science and Technology, Shandong University
Zihao Li
Zihao Li
China University of Geoscience, Wuhan
Computer VisionRemote SensingDeep Learning
Junyi Wang
Junyi Wang
University of Electronic Science and Tenchonolegy of China
Image RegistrationMRI
T
Tian Gan
School of Computer Science and Technology, Shandong University