🤖 AI Summary
Existing methods for audio generation from silent videos lack fine-grained sound event labels—such as event type and onset time—that are temporally aligned with visual content, and often rely on post-processing steps that introduce errors. This work proposes the first unified framework that jointly models audio generation and sound event annotation by introducing an event-aware mechanism in the latent space, enabling end-to-end multi-task training to simultaneously produce audio and frame-level aligned event labels. Evaluated on the Greatest Hits dataset, the approach improves sound onset detection accuracy from 46.7% to 75.0% and boosts material classification accuracy across 17 categories from 40.6% to 61.0%, substantially enhancing the interpretability and practical utility of the generated audio.
📝 Abstract
Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.