๐ค AI Summary
To address the challenge of generating high-fidelity, temporally aligned, and controllable creative sound effects (e.g., cleaning sounds, stylized morphing sounds) for videos, this paper introduces the first video-driven multimodal sound effect generation framework. Methodologically, it innovatively unifies training on large-scale web-sourced low-quality audio-video pairs and professional SFX datasets, building a multimodal diffusion-based architecture that integrates audio-video joint representation learning, a 48-kHz high-bandwidth neural vocoder, and cross-domain mixed training. The framework supports triple-conditioned generationโvia text, reference audio, and input video. Extensive experiments demonstrate significant improvements over prior methods in both automated metrics and human evaluations: generated sounds exhibit precise temporal alignment with visual events, high fidelity (48 kHz), and strong artistic expressiveness. This work establishes the first flexible, cross-modal sound redesign capability tailored specifically for creative sound design.
๐ Abstract
Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley allows users to create clean sounds (e.g., skateboard wheels spinning without wind noise) or more whimsical sounds (e.g., making a lion's roar sound like a cat's meow). MultiFoley also allows users to choose reference audio from sound effects (SFX) libraries or partial videos for conditioning. A key novelty of our model lies in its joint training on both internet video datasets with low-quality audio and professional SFX recordings, enabling high-quality, full-bandwidth (48kHz) audio generation. Through automated evaluations and human studies, we demonstrate that MultiFoley successfully generates synchronized high-quality sounds across varied conditional inputs and outperforms existing methods. Please see our project page for video results: https://ificl.github.io/MultiFoley/