OmniAudio: Generating Spatial Audio from 360-Degree Video

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Traditional video-to-audio generation methods lack explicit 3D spatial sound source modeling. To address this, we introduce the novel task of “360-degree Video-to-Spatial Audio” (360V2SA), aiming to synthesize first-order Ambisonics (FOA) spatial audio from omnidirectional 360° video. We present Sphere360—the first large-scale, paired 360° video–spatial audio dataset—and propose a dual-branch multi-view video encoder that jointly fuses 360°全景 and field-of-view (FoV) video features. We further design a joint self-supervised pretraining framework for spatial and non-spatial audio, and develop a semi-automated audio-video pairing cleaning pipeline. Evaluated on Sphere360, our method achieves state-of-the-art performance in both objective metrics (SI-SNR, PESQ) and subjective MOS scores, enabling real-time, high-fidelity 3D audio generation. Code and the Sphere360 dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Traditional video-to-audio generation techniques primarily focus on field-of-view (FoV) video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create Sphere360, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework OmniAudio, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and FoV video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets will be released at https://github.com/liuhuadai/OmniAudio. The demo page is available at https://OmniAudio-360V2SA.github.io.

Problem

Research questions and friction points this paper is trying to address.

Generating spatial audio from 360-degree videos

Creating First-order Ambisonics (FOA) audio format

Leveraging self-supervised pre-training for 3D audio

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates spatial audio from 360-degree videos

Uses self-supervised pre-training with FOA audio

Dual-branch framework for panoramic and FoV inputs

🔎 Similar Papers

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound