Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

πŸ“… 2024-10-14
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of language-driven, physically plausible binaural audio generation. Methodologically, we propose the first spatial audio synthesis framework supporting multi-source dynamic sound fields and precise spatial control. Our approach introduces a spatially aware encoder and an azimuth-state matrix to guide latent diffusion models; constructs BEWO-1Mβ€”the first GPT-augmented, simulation-driven million-scale tri-modal dataset of spatial audio, text, and images; and integrates multimodal retrieval alignment with GPT-assisted data synthesis. Experimental results demonstrate that our method significantly outperforms existing approaches in both objective and subjective evaluations. Generated audio strictly adheres to acoustic physical principles, enabling high-fidelity, accurately localized, and trajectory-controllable immersive spatial audio synthesis. The framework effectively mitigates two key bottlenecks: weak multi-source spatial modeling capability and scarcity of high-quality spatial audio training data.

Technology Category

Application Category

πŸ“ Abstract
Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for Latent Diffusion Models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our model not only achieves the objective of generating immersive and controllable spatial audio from text but also extends to other modalities as the pioneer attempt. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.
Problem

Research questions and friction points this paper is trying to address.

Generating immersive spatial audio from text
Overcoming high data costs in stereo audio
Ensuring spatial accuracy in audio generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-assisted dataset construction
SpatialSonic model with azimuth matrices
Multimodal spatial audio generation
πŸ”Ž Similar Papers
No similar papers found.
Peiwen Sun
Peiwen Sun
Multimedia lab, The Chinese University of Hong Kong
multimodal learning
S
Sitong Cheng
Hong Kong University of Science and Technology
Xiangtai Li
Xiangtai Li
Research Scientist, Tiktok, SG; MMLab@NTU
Generative AIComputer Vision
Z
Zhen Ye
Hong Kong University of Science and Technology
H
Huadai Liu
Zhejiang University
H
Honggang Zhang
Beijing University of Posts and Telecommunications
W
Wei Xue
Hong Kong University of Science and Technology
Y
Yi-Ting Guo
Hong Kong University of Science and Technology