SoundReactor: Frame-level Online Video-to-Audio Generation

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing video-to-audio (V2A) models operate offline, hindering real-time interactive applications. This work introduces the first end-to-end causal framework for frame-level online V2A generation: it employs a decoder-only causal Transformer to model continuous audio latent variables, uses DINOv2 grid features—aggregated per frame—as visual tokens, and leverages diffusion pretraining combined with consistency fine-tuning to accelerate inference while strictly enforcing temporal causality and ultra-low latency. The method generates semantically and temporally aligned full-band stereo audio from AAA game videos, achieving significant improvements over baselines in both objective metrics and subjective MOS scores. Empirically, it attains a per-frame latency of only 26.3 ms (NFE = 1) and enables real-time inference at 480p resolution and 30 FPS on a single H100 GPU. This establishes a novel paradigm for generative world models and live-stream content creation.

Technology Category

Application Category

📝 Abstract

Prevailing Video-to-Audio (V2A) generation models operate offline, assuming an entire video sequence or chunks of frames are available beforehand. This critically limits their use in interactive applications such as live content creation and emerging generative world models. To address this gap, we introduce the novel task of frame-level online V2A generation, where a model autoregressively generates audio from video without access to future video frames. Furthermore, we propose SoundReactor, which, to the best of our knowledge, is the first simple yet effective framework explicitly tailored for this task. Our design enforces end-to-end causality and targets low per-frame latency with audio-visual synchronization. Our model's backbone is a decoder-only causal transformer over continuous audio latents. For vision conditioning, it leverages grid (patch) features extracted from the smallest variant of the DINOv2 vision encoder, which are aggregated into a single token per frame to maintain end-to-end causality and efficiency. The model is trained through a diffusion pre-training followed by consistency fine-tuning to accelerate the diffusion head decoding. On a benchmark of diverse gameplay videos from AAA titles, our model successfully generates semantically and temporally aligned, high-quality full-band stereo audio, validated by both objective and human evaluations. Furthermore, our model achieves low per-frame waveform-level latency (26.3ms with the head NFE=1, 31.5ms with NFE=4) on 30FPS, 480p videos using a single H100. Demo samples are available at https://koichi-saito-sony.github.io/soundreactor/.

Problem

Research questions and friction points this paper is trying to address.

Enables real-time audio generation from video frames

Addresses limitations of offline V2A models for interactive applications

Achieves low-latency synchronized audiovisual generation without future frames

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frame-level online video-to-audio generation

Causal transformer with DINOv2 vision features

Diffusion pre-training with consistency fine-tuning

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

Authors to Follow