OmniAudio: Generating Spatial Audio from 360-Degree Video

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional video-to-audio generation methods lack explicit 3D spatial sound source modeling. To address this, we introduce the novel task of “360-degree Video-to-Spatial Audio” (360V2SA), aiming to synthesize first-order Ambisonics (FOA) spatial audio from omnidirectional 360° video. We present Sphere360—the first large-scale, paired 360° video–spatial audio dataset—and propose a dual-branch multi-view video encoder that jointly fuses 360°全景 and field-of-view (FoV) video features. We further design a joint self-supervised pretraining framework for spatial and non-spatial audio, and develop a semi-automated audio-video pairing cleaning pipeline. Evaluated on Sphere360, our method achieves state-of-the-art performance in both objective metrics (SI-SNR, PESQ) and subjective MOS scores, enabling real-time, high-fidelity 3D audio generation. Code and the Sphere360 dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Traditional video-to-audio generation techniques primarily focus on field-of-view (FoV) video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create Sphere360, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework OmniAudio, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and FoV video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets will be released at https://github.com/liuhuadai/OmniAudio. The demo page is available at https://OmniAudio-360V2SA.github.io.
Problem

Research questions and friction points this paper is trying to address.

Generating spatial audio from 360-degree videos
Creating First-order Ambisonics (FOA) audio format
Leveraging self-supervised pre-training for 3D audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates spatial audio from 360-degree videos
Uses self-supervised pre-training with FOA audio
Dual-branch framework for panoramic and FoV inputs
🔎 Similar Papers
H
Huadai Liu
Zhejiang University
T
Tianyi Luo
Zhejiang University
Q
Qikai Jiang
Zhejiang University
K
Kaicheng Luo
Zhejiang University
Peiwen Sun
Peiwen Sun
Multimedia lab, The Chinese University of Hong Kong
multimodal learning
J
Jialei Wan
Zhejiang University
Rongjie Huang
Rongjie Huang
FAIR, Zhejiang University
Multimedia ComputingSpeechNatural Language Processing
Q
Qian Chen
Tongyi Lab, Alibaba Group
W
Wen Wang
Tongyi Lab, Alibaba Group
Xiangtai Li
Xiangtai Li
Research Scientist, Tiktok, SG; MMLab@NTU
Generative AIComputer Vision
Shiliang Zhang
Shiliang Zhang
Department of Computer Science, School of EECS, Peking University
Multimedia Information RetrievalMultimedia SystemsVisual Search
Z
Zhijie Yan
Tongyi Lab, Alibaba Group
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing
W
Wei Xue
Hong Kong University of Science and Technology