ViSAGe: Video-to-Spatial Audio Generation

📅 2025-06-13
🏛️ International Conference on Learning Representations
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the problem of end-to-end generation of first-order Ambisonics (FOA) spatial audio from silent videos to lower the barrier for immersive audiovisual production. To this end, we introduce YT-Ambigen, a large-scale dataset comprising 102,000 video–spatial-audio pairs. We propose a vision–direction jointly guided autoregressive neural audio codec architecture—the first to enable direct video-to-FOA end-to-end synthesis. Furthermore, we design a novel evaluation paradigm based on energy map and saliency matching, which jointly assesses spatial fidelity and dynamic scene consistency. Experiments demonstrate that our method surpasses two-stage baseline approaches in both objective metrics and subjective listening tests, achieving significant improvements in sound source localization accuracy and viewpoint adaptability.

Technology Category

Application Category

📝 Abstract
Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we introduce YT-Ambigen, a dataset comprising 102K 5-second YouTube video clips paired with corresponding first-order ambisonics. We also propose new evaluation metrics to assess the spatial aspect of generated audio based on audio energy maps and saliency metrics. Furthermore, we present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework that generates first-order ambisonics from silent video frames by leveraging CLIP visual features, autoregressive neural audio codec modeling with both directional and visual guidance. Experimental results demonstrate that ViSAGe produces plausible and coherent first-order ambisonics, outperforming two-stage approaches consisting of video-to-audio generation and audio spatialization. Qualitative examples further illustrate that ViSAGe generates temporally aligned high-quality spatial audio that adapts to viewpoint changes.
Problem

Research questions and friction points this paper is trying to address.

Generates spatial audio from silent videos
Creates first-order ambisonics without complex recording
Evaluates spatial audio quality using new metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates first-order ambisonics from silent videos
Uses CLIP features and autoregressive neural audio codec
Introduces YT-Ambigen dataset and new evaluation metrics
🔎 Similar Papers
No similar papers found.