Video-Guided Foley Sound Generation with Multimodal Controls

📅 2024-11-26

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the challenge of generating high-fidelity, temporally aligned, and controllable creative sound effects (e.g., cleaning sounds, stylized morphing sounds) for videos, this paper introduces the first video-driven multimodal sound effect generation framework. Methodologically, it innovatively unifies training on large-scale web-sourced low-quality audio-video pairs and professional SFX datasets, building a multimodal diffusion-based architecture that integrates audio-video joint representation learning, a 48-kHz high-bandwidth neural vocoder, and cross-domain mixed training. The framework supports triple-conditioned generation—via text, reference audio, and input video. Extensive experiments demonstrate significant improvements over prior methods in both automated metrics and human evaluations: generated sounds exhibit precise temporal alignment with visual events, high fidelity (48 kHz), and strong artistic expressiveness. This work establishes the first flexible, cross-modal sound redesign capability tailored specifically for creative sound design.

Technology Category

Application Category

📝 Abstract

Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/

Problem

Research questions and friction points this paper is trying to address.

Video Production

Audio Effects

Control and Adjustment

Innovation

Methods, ideas, or system contributions that make the work stand out.

MultiFoley

High-Fidelity Sound Generation

Audio-Visual Synchronization

🔎 Similar Papers

Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound

2024-08-21arXiv.orgCitations: 6

xAI

$180,000 - $440,000 USD

Palo Alto, CA / Seattle, WA / Palo Alto, CA, Palo Alto, California, United States

AI Research Scientist, Computer Vision - Facebook Video Intelligence