🤖 AI Summary
To address temporal misalignment, semantic inconsistency, and reliance on costly manual timestamp annotations in video-driven Foley sound synthesis, this paper proposes Video2Sound—a fully end-to-end self-supervised framework. Methodologically, it introduces RMS intensity envelopes as unsupervised temporal event cues for the first time, enabling fine-grained control via RMS discretization and a novel RMS-ControlNet that jointly incorporates textual and audio-semantic prompts. The architecture adopts a two-stage pipeline—Video2RMS followed by RMS2Sound—integrating self-supervised learning, ControlNet-guided diffusion modeling, pretrained text-to-audio (T2A) priors, and explicit RMS feature representation. Extensive evaluation demonstrates state-of-the-art performance in audiovisual alignment, impact timing accuracy, dynamic intensity modeling, timbral fidelity, and controllability over sound details. The code, pretrained models, and interactive demo are publicly released.
📝 Abstract
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. Systems lacking explicit temporal features suffer from poor alignment and controllability, while timestamp-based models require costly and subjective human annotation. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as an intuitive condition with semantic timbre prompts (audio or text). RMS, a frame-level intensity envelope closely related to audio semantics, acts as a temporal event feature to guide audio generation from video. The annotation-free self-supervised learning framework consists of two stages, Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization and RMS-ControlNet with a pretrained text-to-audio model. Our extensive evaluation shows that Video-Foley achieves state-of-the-art performance in audio-visual alignment and controllability for sound timing, intensity, timbre, and nuance. Source code, model weights and demos are available on our companion website. (https://jnwnlee.github.io/video-foley-demo)