OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video-driven audio generation methods struggle to jointly model in-frame and off-frame environmental sounds alongside human speech, lacking the capability to synthesize complete auditory scenes. This work introduces the UniHAGen task and the OmniSonic framework, which for the first time enables unified generation of these three sound categories. Built upon a flow-matching diffusion model, OmniSonic incorporates a TriAttn-DiT architecture with triple cross-attention and a Mixture-of-Experts (MoE) gating mechanism to adaptively fuse multimodal video and text conditions. The study also establishes UniHAGen-Bench, the first comprehensive benchmark encompassing both speech and environmental sounds. Experimental results demonstrate that the proposed method significantly outperforms existing approaches in both objective metrics and human evaluations, setting a strong baseline for full-scene audio generation.
📝 Abstract
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/
Problem

Research questions and friction points this paper is trying to address.

audio generation
on-screen sound
off-screen sound
speech synthesis
video-to-audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

flow-matching diffusion
TriAttn-DiT
Mixture-of-Experts
holistic audio generation
text-video-to-audio
🔎 Similar Papers
2024-09-13IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 4