🤖 AI Summary
Existing video-driven audio generation methods struggle to jointly model in-frame and off-frame environmental sounds alongside human speech, lacking the capability to synthesize complete auditory scenes. This work introduces the UniHAGen task and the OmniSonic framework, which for the first time enables unified generation of these three sound categories. Built upon a flow-matching diffusion model, OmniSonic incorporates a TriAttn-DiT architecture with triple cross-attention and a Mixture-of-Experts (MoE) gating mechanism to adaptively fuse multimodal video and text conditions. The study also establishes UniHAGen-Bench, the first comprehensive benchmark encompassing both speech and environmental sounds. Experimental results demonstrate that the proposed method significantly outperforms existing approaches in both objective metrics and human evaluations, setting a strong baseline for full-scene audio generation.
📝 Abstract
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/