Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

📅 2024-08-21
🏛️ arXiv.org
📈 Citations: 6
Influential: 1
📄 PDF
🤖 AI Summary
To address temporal misalignment, semantic inconsistency, and reliance on costly manual timestamp annotations in video-driven Foley sound synthesis, this paper proposes Video2Sound—a fully end-to-end self-supervised framework. Methodologically, it introduces RMS intensity envelopes as unsupervised temporal event cues for the first time, enabling fine-grained control via RMS discretization and a novel RMS-ControlNet that jointly incorporates textual and audio-semantic prompts. The architecture adopts a two-stage pipeline—Video2RMS followed by RMS2Sound—integrating self-supervised learning, ControlNet-guided diffusion modeling, pretrained text-to-audio (T2A) priors, and explicit RMS feature representation. Extensive evaluation demonstrates state-of-the-art performance in audiovisual alignment, impact timing accuracy, dynamic intensity modeling, timbral fidelity, and controllability over sound details. The code, pretrained models, and interactive demo are publicly released.

Technology Category

Application Category

📝 Abstract
Foley sound synthesis is crucial for multimedia production, enhancing user experience by synchronizing audio and video both temporally and semantically. Recent studies on automating this labor-intensive process through video-to-sound generation face significant challenges. Systems lacking explicit temporal features suffer from poor alignment and controllability, while timestamp-based models require costly and subjective human annotation. We propose Video-Foley, a video-to-sound system using Root Mean Square (RMS) as an intuitive condition with semantic timbre prompts (audio or text). RMS, a frame-level intensity envelope closely related to audio semantics, acts as a temporal event feature to guide audio generation from video. The annotation-free self-supervised learning framework consists of two stages, Video2RMS and RMS2Sound, incorporating novel ideas including RMS discretization and RMS-ControlNet with a pretrained text-to-audio model. Our extensive evaluation shows that Video-Foley achieves state-of-the-art performance in audio-visual alignment and controllability for sound timing, intensity, timbre, and nuance. Source code, model weights and demos are available on our companion website. (https://jnwnlee.github.io/video-foley-demo)
Problem

Research questions and friction points this paper is trying to address.

Automates Foley sound synthesis for multimedia production.
Addresses poor alignment and controllability in video-to-sound generation.
Eliminates need for costly human annotation in sound timing.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses RMS as temporal event feature
Self-supervised two-stage learning framework
Incorporates RMS discretization and RMS-ControlNet
🔎 Similar Papers
No similar papers found.
Junwon Lee
Junwon Lee
KAIST
Controllable Audio GenerationMultimodal LearningMusic Information Retrieval
J
Jae-Yeol Im
Graduate School of Culture Technology, KAIST, Republic of Korea
D
Dabin Kim
Graduate School of Culture Technology, KAIST, Republic of Korea
Juhan Nam
Juhan Nam
KAIST
Music TechnologyMusic Information RetrievalAudio Signal ProcessingMusic Processing